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Abstract 

The  linear  least  trimmed  squares  (LTS)  estimator  is  a  statistical  technique  for  fitting  a  linear 
model  to  a  set  of  points.  It  was  proposed  by  Rousseeuw  as  a  robust  alternative  to  the  classical 
least  squares  estimator.  Given  a  set  of  n  points  in  qje  objective  is  to  minimize  the  sum  of 
the  smallest  50%  squared  residuals  (or  more  generally  any  given  fraction).  There  exist  practical 
heuristics  for  computing  the  linear  LTS  estimator,  but  they  provide  no  guarantees  on  the  accuracy 
of  the  final  result.  Two  results  are  presented.  First,  a  measure  of  the  numerical  condition  of  a  set 
of  points  is  introduced.  Based  on  this  measure,  a  probabilistic  analysis  of  the  accuracy  of  the  best 
LTS  fit  resulting  from  a  set  of  random  elemental  fits  is  presented.  This  analysis  shows  that  as 
the  condition  of  the  point  set  improves,  the  accuracy  of  the  resulting  fit  also  increases.  Second, 
a  new  approximation  algorithm  for  LTS,  called  Adaptive-LTS,  is  described.  Given  bounds  on 
the  minimum  and  maximum  slope  coefficients,  this  algorithm  returns  an  approximation  to  the 
optimal  LTS  fit  whose  slope  coefficients  lie  within  the  given  bounds.  Empirical  evidence  of  this 
algorithm’s  efficiency  and  effectiveness  is  provided  for  a  variety  of  data  sets. 
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1.  Introduction 

Consider  an  n-element  point  set  P  -  {pi,...,p„),  where  pi  —  (Xj-j, . . . , x,;rf_i,y,)  e  In 
standard  linear  regression  with  intercept,  we  assume  the  model 

d-\ 

yi  ^  f^PjXij+Pd  +  Ci,  for/ = 

7=1 
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where  (x,;i, . . . ,  Xid-i)  are  the  given  independent  variables,  the  y,’s  are  the  given  dependent  vari¬ 
ables,  p  —  (ySi, . . .  ,jSj)  is  the  unknown  coefficient  vector,  and  the  e,’s  are  the  errors.  We  refer 
to  {fii,. . .  ,l3d-i)  as  the  slope  coefficients  and  fid  as  the  intercept.  Given  an  estimator  e 
define  the  ith  residual  to  be  r,(jS,  P)  —yi-{  Fijfl  ^ jXij  +  Pd)-  Let  P)  denote  the  /th  smallest 

residual  in  terms  of  absolute  value. 

Robust  estimators  (see,  e.g.,  [1])  have  been  introduced  in  order  to  eliminate  sensitivity  to 
outliers,  that  is,  points  that  fail  to  follow  the  linear  pattern  of  the  majority  of  the  points.  The 
basic  measure  of  the  robustness  of  an  estimator  is  its  breakdown  point,  that  is,  the  fraction  (up  to 
50%)  of  outlying  data  points  that  can  corrupt  the  estimator  arbitrarily.  One  of  the  most  widely 
studied  robust  linear  estimators  is  Rousseeuw’s  least  median  of  squares  estimator  (LMS)  [2], 
which  is  defined  to  be  the  estimator  that  minimizes  the  median  squared  residual.  More  generally, 
given  an  integer  coverage  value  h,  the  objective  is  to  find  the  hyperplane  that  minimizes  the  Mb 
smallest  squared  residual.  A  number  of  papers,  both  practical  and  theoretical,  have  been  devoted 
to  computing  this  estimator  in  the  plane  and  in  higher  dimensions  (see,  e.g.,  [3,  4,  5,  6,  7,  8,  9]). 

It  has  been  observed  by  Rousseeuw  and  Leroy  [1]  that  LMS  may  not  be  the  best  estimator 
from  the  perspective  of  statistical  properties.  They  argue  in  support  of  the  least  trimmed  squares 
(or  LTS)  linear  estimator  [2].  Given  an  n-element  point  set  P  and  a  coverage  h  <  n,it  is  defined  to 
be  the  estimator  that  minimizes  the  sum  (as  opposed  to  the  maximum)  of  the  h  smallest  squared 
residuals.  For  the  sake  of  preserving  scale,  we  convert  this  into  a  quantity  that  more  closely 
resembles  a  standard  deviation.  More  formally,  given  a  non-vertical  hyperplane  ji  define  its  LTS 
cost  with  respect  to  P  and  h  to  be 


A^(P,h) 


[h-l 


Note  that  this  measure  is  scale  equivariant,  and  minimizing  it  is  equivalent  to  minimizing  the  sum 
of  squared  residuals.  The  LTS  estimator  is  the  coefficient  vector  of  minimum  LTS  cost,  which 
we  denote  throughout  by  P*(P,  h).  Let  A*(P,  h)  denote  the  associated  LTS  cost.  When  P  and  h  are 
clear  from  context,  we  refer  to  these  simply  as  jS*  and  A*,  respectively.  The  LTS  problem  is  that 
of  computing  p*  from  P  and  h.  We  refer  to  the  points  having  the  h  smallest  squared  residuals  as 
inliers  and  the  remaining  points  as  outliers.  This  generalizes  the  ordinary  least  squares  estimator 
(when  h  =  n).  It  is  customary  to  set  h  -  [{n  +  d  +  1)/2J  for  outlier  detection  [1].  In  practice, 
h  may  be  set  to  some  constant  fraction  of  n  based  on  the  expected  number  of  outliers.  The 
statistical  properties  of  LTS  are  analyzed  in  [1,  2]. 

The  computational  complexity  of  LTS  is  less  well  understood  than  that  of  LMS.  Various 
exact  algorithms  have  been  presented,  which  are  based  on  variants  of  branch-and-bound  search 
and  efficient  incremental  updates  [10,  11,  12].  Unfortunately,  these  algorithms  are  practical  only 
for  fairly  small  point  sets.  Hossjer  [13]  presented  an  0{n^  logn)  algorithm  for  LTS  in  based 
on  plane  sweep.  In  a  companion  paper  [14]  we  presented  an  exact  algorithm  for  LTS  in 
which  runs  in  0(n^)  time.  We  also  presented  an  algorithm  for  for  any  d  >  3,  which  runs 
in  0(n‘^^^)  time.  (Throughout,  we  assume  that  d  is  a  fixed  constant.)  For  large  n  these  running 
times  may  be  unacceptably  high,  even  in  spaces  of  moderate  dimension. 

Given  these  relatively  high  running  times,  it  is  natural  to  consider  whether  this  problem  can 
be  solved  approximately.  There  are  a  few  possible  ways  to  formulate  LTS  as  an  approximation 
problem,  either  by  approximating  the  residual,  by  approximating  the  quantile,  or  both.  The 
following  formulations  were  introduced  in  [14].  The  approximation  parameters  Sr  and  Sq  denote 
the  allowed  residual  and  quantile  errors,  respectively. 
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Residual  Approximation:  The  requirement  of  minimizing  the  sum  of  squared  residuals  is  re¬ 
laxed.  Given  0  <  Sr,  an  Sr-residual  approximation  is  any  hyperplane  p  such  that 


^p{P,h)  <  (1 +e,)A*(P,/i). 

Quantile  Approximation:  Much  of  the  complexity  of  LTS  arises  because  of  the  requirement 
that  exactly  h  points  be  covered.  We  can  relax  this  requirement  by  introducing  a  parameter 
0  <  Eg  <  hjn  and  requiring  that  the  fraction  of  inliers  used  is  smaller  by  Eq.  Let  — 
h  -  \  nEq\.  An  Eq-quantile  approximation  is  any  hyperplane  p  such  that 

A/j(P,h-)  <  A*(P,h). 

Hybrid  Approximation:  The  above  approximations  can  be  merged  into  a  single  approximation. 
Given  Er  and  Eq  as  in  the  previous  two  approximations,  let  be  as  defined  above.  An 
(Er,  Eq)-hybrid  approximation  is  any  hyperplane  p  such  that 

Aii{P,h-)  <  (1  -He,)A*(P,/i). 

Note  that  approximating  the  LTS  cost  does  not  imply  that  the  optimum  slope  coefficients 
themselves  are  well  approximated.  Computing  an  approximation  to  the  slope  coefficients  seems 
to  be  difficult.  In  particular,  for  some  pathological  point  sets  there  may  be  many  solutions  that 
have  nearly  the  same  LTS  costs  but  very  different  slope  coefficients.  Consider,  for  example, 
fitting  a  plane  to  a  set  of  points  uniformly  distributed  within  a  sphere.  In  an  earlier  paper  [14],  we 
presented  an  approximation  algorithm  for  LTS  whose  execution  time  is  roughly  0(n‘^lh).  In  the 
same  paper,  we  presented  asymptotic  lower  bounds  for  computing  the  LTS  and  the  related  LTA 
(least  trimmed  absolute  value)  estimators.  These  results  suggest  that  substantial  improvements 
to  these  worst-case  complexity  bounds,  either  exact  or  approximate,  are  unlikely. 

There  are,  however,  simple  and  practical  heuristic  algorithms  for  LTS.  One  is  the  Fast-LTS 
heuristic  of  Rousseeuw  and  Van  Driessen  [15].  This  algorithm  is  based  on  a  combination  of 
random  sampling  and  local  improvement.  It  involves  repeatedly  fitting  a  hyperplane  to  a  small 
random  sample  of  points,  which  is  called  an  elemental  fit,  and  then  applying  a  small  number  of 
local  improvements,  called  concentration  steps  or  C-steps.  (The  algorithm  will  be  presented  in 
Section  2.)  Each  C-step  transforms  an  existing  fit  into  one  of  equal  or  lower  LTS  cost.  In  practice 
Fast-LTS  performs  quite  well.  However,  it  provides  no  assurance  on  the  quality  of  the  final  fit. 
A  more  efficient  implementation  of  this  general  approach  was  presented  recently  by  Torti  et  al. 
[16].  Another  example  of  a  local  search  heuristic  is  Hawkins’  feasible  point  algorithm  [17].  It 
does  not  offer  any  formal  performance  guarantees  either. 

While  Fast-LTS  works  well  in  practice,  it  does  not  represent  a  truly  satisfactory  algorithmic 
solution  to  the  LTS  problem.  It  implicitly  assumes  that  the  data  are  sufficiently  well  conditioned 
such  that,  with  reasonably  high  probability,  a  random  elemental  fit  is  close  enough  to  the  optimal 
fit  such  that  subsequent  C-steps  will  converge  to  a  fit  that  is  very  nearly  optimal.  If  the  data  are 
poorly  conditioned,  however,  Fast-LTS  may  require  a  very  large  number  of  random  trials  before 
it  succeeds.  This  raises  the  question  of  what  constitutes  a  well-conditioned  point  set  and  whether 
it  is  possible  to  derive  an  algorithm  that  provides  results  of  guaranteed  accuracy  on  all  input  sets. 

We  present  two  principal  results  in  this  paper.  First,  in  Section  2  we  introduce  a  measure 
of  the  numerical  condition  of  a  set  of  points  and  present  a  probabilistic  analysis  of  the  accuracy 
of  the  LTS  fit  resulting  from  a  series  of  random  elemental  fits.  Our  results  show  that  as  the 
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condition  of  the  point  set  improves,  the  probability  of  obtaining  an  accurate  fit  improves  as  well. 
Second,  in  Section  3  we  present  an  algorithm  that  produces  outputs  of  guaranteed  approximate 
accuracy.  Specifically,  we  present  a  hybrid  approximation  algorithm  for  LTS,  called  Adaptive- 
LTS.  The  input  to  the  algorithm  consists  of  P,  h,  bounds  on  the  minimum  and  maximum  slope 
coefficients,  and  approximation  parameters  Sr  and  Sq.  It  computes  a  hyperplane  that  is  an  (e^,  e^)- 
hybrid  approximation  to  the  optimal  hyperplane  whose  slope  coefficients  lie  within  the  specified 
bounds.  The  most  complex  aspect  of  the  algorithm  involves  the  solution  of  a  variant  of  the  1- 
dimensional  LTS  problem  for  a  set  of  input  intervals,  which  we  call  the  interval  LTS  problem. 
In  Section  4  we  present  an  0(n  log  n)  time  solution  to  this  problem.  We  have  implemented  the 
Adaptive-LTS  algorithm,  and  in  Section  5,  we  present  empirical  studies  contrasting  these  two 
algorithms. 

2.  Numerical  Condition  and  Elemental  Fits 

In  this  section  we  consider  the  question  of  how  the  numerical  conditioning  of  a  point  set 
influences  the  accuracy  of  an  LTS  fit  based  on  random  elemental  fits.  Consider  a  set  of  n  points 
P  in  M"'  and  a  coverage  value  h  <  n.  We  assume  that  the  points  are  in  general  position,  so 
that  any  subset  of  d  points  defines  a  unique  {d  —  l)-dimensional  hyperplane.  This  can  always 
be  achieved  through  an  infinitesimal  perturbation.  Any  hyperplane  generated  in  this  manner  is 
called  an  elemental  fit.  Elemental  fits  play  an  important  role  in  Fast-LTS  (described  below).  In 
this  section  we  introduce  a  measure  on  the  numerical  condition  of  a  point  set  and  establish  formal 
bounds  on  the  accuracy  of  a  random  elemental  fit  in  terms  of  this  measure. 

We  begin  with  a  brief  presentation  of  Rousseeuw  and  van  Driessen’s  Fast-LTS  algorithm. 
(Our  description  is  slightly  simplified,  and  we  refer  the  reader  to  [15]  for  details.)  The  algorithm 
consists  of  a  series  of  stages.  Each  stage  starts  by  generating  a  random  subset  of  P  of  size  d  and 
fitting  a  hyperplane  p'  to  these  points.  It  then  generates  a  series  of  local  improvements  as  follows. 
The  h  points  of  P  that  have  the  smallest  squared  residuals  with  respect  to  p'  are  determined.  Let 
P"  be  the  least  squares  linear  estimator  of  these  points.  Clearly,  the  LTS  cost  of  p”  is  not  greater 
than  that  of  p' .  We  set  p’  <—  p”  and  repeat  the  process.  This  is  called  a  C-step.  C-steps  are 
repeated  a  small  fixed  number  of  times. 

The  overall  algorithm  operates  by  performing  a  large  number  of  stages,  where  each  stage 
starts  with  a  new  random  elemental  fit.  After  all  stages  have  completed,  a  small  number  of 
results  with  the  lowest  LTS  costs  are  saved.  For  each  of  these,  C-steps  are  repeatedly  applied 
until  convergence.  (It  is  easy  to  show  that  the  result  is  at  a  local  minimum  of  the  LTS  cost 
function.)  The  algorithm’s  final  output  is  the  hyperplane  with  the  lowest  overall  cost.  Assuming 
constant  dimension,  each  C-step  can  be  performed  in  0{n)  time,  and  thus  the  overall  running 
time  is  0(kn),  where  k  is  the  number  of  stages. 

It  is  easy  to  see  why  Fast-LTS  performs  well  under  nominal  circumstances.  Suppose  that  we 
succeed  in  sampling  d  inliers,  which,  assuming  at  least  h  inliers,  would  be  expected  to  occur  with 
probability  roughly  (h/nY.  Suppose  as  well  that  these  points  are  well  conditioned  in  the  sense 
that  they  define  a  unique  hyperplane  even  considering  floating-point  round-off  errors.  Then  the 
resulting  elemental  fit  should  be  sufficiently  close  to  the  optimum  that  the  subsequent  C-steps 
will  converge  to  the  true  optimum.  Thus,  although  there  are  no  guarantees  of  good  performance, 
if  the  data  are  well  conditioned,  then  after  a  sufficiently  large  number  of  random  starts,  it  is 
reasonable  to  believe  that  at  least  one  of  the  resulting  fits  will  lead  to  the  true  optimum. 

The  objective  of  this  section  is  explore  how  we  might  formalize  the  notion  of  “well  condi¬ 
tioned’’  data.  We  will  ignore  the  effect  of  C-steps,  and  focus  only  on  analyzing  the  performance 
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of  repeated  random  elemental  fits.  (An  analysis  of  the  effect  of  C-steps  would  be  a  valuable 
addition  to  our  results,  but  this  seems  to  be  a  significantly  harder  problem.)  Our  aim  is  to  provide 
sufficient  conditions  on  the  properties  of  point  sets  so  that  with  constant  probability,  a  random 
elemental  fit  will  provide  a  reasonably  good  approximation  to  the  optimum  LTS  hyperplane. 

Rousseeuw  and  Leroy  provide  an  analysis  of  the  number  of  elemental  fits  needed  to  obtain 
a  good  fit  with  a  given  probability  (see  Chapter  5  of  [1]),  but  their  analysis  makes  the  tacit 
assumption  that  the  inkers  lie  on  the  optimum  LTS  fit  and  thus  determine  a  unique  hyperplane. 
Such  an  argument  ignores  the  potential  effects  of  numerical  instabilities.  For  example,  if  the 
inliers  are  clustered  close  to  a  low-dimensional  fiat,^  then  the  resulting  elemental  fit  will  be 
numerically  unstable. 

Fix  P  and  h  throughout  the  remainder  of  this  section,  and  let  =  jS*(P,  h)  denote  the  opti¬ 
mum  LTS  estimator.  Define  jt,  to  be  the  t/-element  column  vector  (x,  i, . . . ,  1)  and p  to  be 

the  row  vector  ()6j, . . .  ,jSj).  The  ith  residual  is  just  -x,.  Let  £(P)  denote  the  collection  of  all 
subsets  of  cardinality  d  of  the  index  set  {1, ... ,  n).  Given  E  =  {/i, . . . ,  L)  e  £,  let  Xe  denote 
the  d  X  d  matrix  whose  columns  are  the  coordinate  vectors  of  the  corresponding  independent 
variables: 


Let  yE  be  the  cZ-element  row  vector  of  associated  dependent  variables  (y,j , . . . ,  y,^).  By  our  general 
position  assumption,  these  vectors  are  linearly  independent,  and  so  there  is  a  unique  t/-element 
row  vector  p'  such  that 


yE  =  that  is  P'  = 

(see  Fig.  1).  For  our  subsequent  analysis,  we  also  define  y*^  =  P*Xe  to  be  the  vector  of  y-values 
resulting  from  evaluating  the  optimum  LTS  hyperplane  at  the  points  of  the  elemental  set.  Thus, 
we  have  p*  - 


• 

'  ,  ^  • 

yj 

• 

1  yj . *"•" 

.yj 

VE 

=  [yivVn) 

ve 

=  {ytvy*^) 

Figure  1:  Estimated  and  optimal  fits  for  E  =  and  =  0^/j  ,y/2)-  Solid  points  indicate  points  of  P  and  hollow 

points  are  computed  quantities. 

We  begin  with  a  few  standard  definitions  [19,  20].  Consider  a  ^i-vector  x  and  didxd  matrix 
X.  Let  Xij  denote  the  element  in  row  i  and  column  j  of  X,  and  let  Xj  denote  the  jih  column 


^This  arises,  for  example,  when  fitting  a  plane  to  a  set  of  points  in  ^/-dimensional  space  that  are  clustered  very  close 
to  a  /:-dimensional  flat,  where  0  <  k  <  d  -  2.  One  way  to  create  such  a  set  in  dimension  d  is  to  select  such  an  integer 
k,  generate  vectors  whose  first  k  coordinates  are  uniform  over  [0, 1]  and  whose  remaining  d  -  k  coordinates  are  drawn 
from  a  Gaussian  with  a  very  small  standard  deviation,  and  then  reorient  the  flat  by  applying  a  random  rotation  from  the 
orthogonal  group  SO(i/)  [18]. 
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vector.  The  norms  of  x  and  X  (in  the  L2  and  Frobenius  norms,  respectively)  are  defined  to  be 


ll^ll  = 


and  ||X||  = 


1/2 


Let  a:(X)  denote  the  condition  number  of  X  relative  to  this  norm,  that  is,  k{X)  -  Ill'll  •  ||Z'"^||. 
The  concept  of  condition  number  is  well  known  in  matrix  algebra  as  a  means  to  characterize  the 
stability  of  linear  systems  involving  X  [20].  It  is  a  dimensionless  quantity,  where  lower  values 
mean  greater  stability. 

Our  condition  measure  is  based  on  two  properties  of  the  independent  variables  of  the  inkers. 
Let  H  c  {!,...,«)  denote  the  /z-element  index  set  of  the  inliers  relative  to  P’s  optimal  LTS 
hyperplane.  Intuitively,  the  first  property  states  that  a  random  elemental  set  of  inliers  provides 
a  stable  elemental  fit.  This  is  expressed  by  bounding  the  median  condition  number  of  Xe  for 
E  e  &(H).  That  is,  we  are  taking  the  median  over  the  collection  of  all  c/-element  subsets  of  the 
h  inliers.  This  property  alone  is  not  sufficient,  since  even  a  minuscule  error  in  the  placement 
of  the  hyperplane  might  be  fatal  if  any  inlier  is  located  so  far  away  that  its  associated  residual 
dominates  the  overall  cost.  Such  a  point  is  called  a  leverage  point  [21].  Leverage  points  are 
problematic  if  they  are  few  in  number  (and  hence,  have  a  low  probability  of  being  sampled) 
and  they  do  not  agree  with  the  linear  model  determined  by  the  remaining  inliers  (and  hence, 
almost  all  elemental  fits  miss  the  leverage  points).  We  will  adopt  a  conservative  approach  of 
considering  all  leverage  points  to  be  harmful.  The  second  property  rules  out  leverage  point 
inliers  by  bounding  the  ratio  between  the  average  squared  norm  of  the  points  Xi,  for  i  e  H  (which 
is  sensitive  to  leverage  points)  and  the  median  squared  norm  of  these  same  points  (which  is  not 
sensitive).  More  formally,  given  two  positive  real  parameters  y  and  A,  we  say  that  P  is  (y,  A)-well 
conditioned  if 


(i):  med  k{Xe)  <  y  and 

Ee6(H) 


(ii): 


LjenWXjW^ 

h  ■  m&AjenWxjW'^ 


<  A\ 


(1) 


Note  that  these  assumptions  constrain  only  the  independent  variables  (not  the  dependent  vari¬ 
ables),  and  they  apply  only  to  the  set  of  inliers  of  the  optimal  fit  (and  thus  there  are  no  assump¬ 
tions  imposed  on  the  outliers).  The  use  of  the  median  in  the  two  assumptions  is  not  critical.  At 
the  expense  of  a  constant  factor  in  the  analysis,  we  could  have  replaced  the  median  with  any 
constant  quantile  of  h. 

The  main  result  of  this  section  is  given  in  the  following  theorem.  It  shows  that  depending  on 
the  condition  of  the  point  set,  after  a  sufficiently  large  number  of  random  elemental  fits,  we  can 
achieve  a  constant-factor  (residual)  approximation  for  LTS  with  arbitrarily  high  confidence. 

Theorem  1.  Let  P  be  an  n-element  point  set  in  that  is  (y,  A)-well  conditioned  for  some  y  and 
A.  Let  h  be  the  coverage,  and  let  q  —  hjn.  Then  for  any  0  <  d  <  1,  with  probability  at  least  1  —  d 
at  least  one  out  of4(2fqY  ln(l/(5)  random  elemental  fits,  denoted  0 ,  satisfies 


Afi’(P,h)  <  (1  -H  02yA)X{P,h). 


Assuming  that  the  dimension  is  a  fixed  constant,  observe  the  number  of  random  elemental 
fits  grows  on  the  order  of  0(q^‘^  ln(l/<5)),  which  is  very  sensitive  to  the  dimension  and  relatively 
insensitive  to  the  failure  probability.  The  proof  relies  on  the  following  lemma,  which  shows  that 
with  constant  probability,  a  single  elemental  fit  provides  the  desired  approximation  bound. 
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Lemma  1.  Given  the  preconditions  of  Theorem  1,  let  E  be  a  random  element  of&(P),  and  let  p' 
denote  the  associated  random  elemental  fit.  With  probability  at  least  {qjTY  jA, 

Afi'(P,h)  <  (1  +  Y2yA)A*(P,h). 


Assuming  for  now  that  this  lemma  holds,  we  obtain  Theorem  1  as  follows.  By  repeating  m 
independent  random  elemental  fits,  the  probability  of  failing  to  achieve  the  stated  approximation 
bound  is  at  most  (1  -  (^/2)'^/4)'”  <  exp(-m(^/2)'^/4).  By  selecting  m  >  4-(2/qY\n(l/6),  the 
probability  of  failure  is  at  most  <5,  from  which  Theorem  1  follows. 

The  remainder  of  the  section  is  devoted  to  proving  Lemma  1 . 

With  probability  at  least  q,  a  random  index  from  {!,...,«)  is  drawn  from  H,  and  so  the 
probability  that  a  random  element  of  S(P)  lies  in  S(H)  approaches  q^  for  all  sufficiently  large  n. 
(Note  that  we  sample  without  replacement.)  The  remainder  of  the  proof  is  conditioned  on  this 
assumption. 

Recall  that  p*  is  the  optimal  LTS  hyperplane,  and  A*  is  its  cost.  Let  A'  =  Ap'{P,h).  Let  PI 
denote  the  set  of  indices  of  the  h  points  of  P  whose  residuals  with  respect  to  p*  are  the  smallest. 
We  make  use  of  the  following  technical  result,  which  is  proved  in  Appendix  A. 


Lemma  2.  Given  A*  and  A'  as  defined  above, 

l|2  ^l/2 


A' 

A* 


1  < 


■  (IK‘II-II^£||)  ■ 


i:jeH\\XjY 


Given  this  result,  let  us  analyze  each  of  these  three  factors  separately.  By  definition,  the 
middle  factor  is  just  k{Xe).  To  analyze  the  first  factor,  observe  that  lly^  -  =  Tjjesiyj  ~  y*jY- 

By  our  earlier  assumption,  £  is  a  random  subset  of  H,  and  so  with  probability  approaching  1  /2'^ 
for  large  n,  every  element  of  {(yj  -  y*Y  :  j  e  E]  is  at  most  medy£//(y^'  -  y*Y.  Since  at  least  half 
of  the  elements  of  the  multiset  {(yj  -  y*Y  :  j  e  H}  are  at  least  as  large  as  the  median  of  the  set,  it 
follows  that  Yjjeniyj  ~  ^  (h  12)  med j^jfiyj  -  y*j)^-  Thus,  with  probability  approaching  1/2^^ 

(and  under  the  assumption  that  E  c  p{)  we  have 


lb'is-3'glP  medjeniyj-y*/  ^  2 

ZjeH(yj-y'jY  ~  (h/2)medjeH(yj-y*)^  ~  h' 

To  analyze  the  third  factor,  recall  that  from  the  definition  of  the  Frobenius  norm,  \\XeY  — 
TjjeE  WxjY-  By  our  earlier  assumption,  £  is  a  random  subset  of  H,  and  so  with  probability  at 
least  1  -  (1/2)^^  >  1  /2,  at  least  one  element  of  {HJtjjp  :  y  e  £)  is  at  least  as  large  as  medj^H  \\XjY. 
If  so,  IIX^lp  is  at  least  as  large  as  the  median  as  well.  Combining  this  with  property  (ii)  of 
well-conditioned  sets,  with  probability  at  least  1  /2  (and  under  the  assumption  that  E  Q  H)  we 
have 

Y.jenWXjY  A?  h-medjenWxjY  ,2, 

IIAfilP  medjenWxjY 

Combining  our  bounds  on  all  three  factors  together  with  condition  (i)  in  the  definition  of  well- 
conditioned  sets,  we  conclude  that  for  large  n,  with  probability  approaching  ^'^(l/2‘^)(l/2)  = 
(^/2)"'/2,  we  have 


Since  £  is  a  random  element  of  &,H),  with  probability  at  least  1/2,  k{Xe)  is  not  greater  than 
med/re£(/f)  K{Xp),  and  so  by  condition  (i)  of  well-conditioned  sets  we  have  k{Xe)  <  y.  Therefore, 
with  probability  at  least  {qjlY jA, 

^  <  1  -H  V2  ■  k{Xe)  ■  T  <  1  -I-  V2y/l. 

A* 

This  completes  the  proof  of  Lemma  1  and  establishes  Theorem  1 . 

Note  that  the  definition  of  well  conditioning  would  appear  to  require  the  computation  of 
medians  over  sets  of  size  O(h^),  which  is  quite  large.  It  is  clear  from  the  proof,  however,  that 
replacing  the  medians  with  the  rth  quantile  in  the  definition  of  well  conditioning  merely  changes 
the  factor  1/2“'^^  of  Lemma  1  to  Thus,  rather  than  computing  the  median  exactly,  it 

would  suffice  to  employ  random  sampling  to  estimate  its  value,  at  the  price  of  slightly  increasing 
the  number  of  elemental  fits. 

Of  course.  Theorem  1  does  not  completely  resolve  the  question  of  how  the  numerical  con¬ 
ditioning  of  the  point  set  affects  the  convergence  of  Fast-LTS.  First,  it  ignores  the  effect  of  C- 
steps.  Second,  since  it  applies  only  to  the  inliers,  whose  elements  we  do  not  normally  know, 
it  is  not  obvious  how  to  turn  this  into  an  effective  computational  procedure.  Nonetheless,  if  a 
priori  knowledge  about  the  inlier  distribution  is  available,  or  if  the  outliers  can  be  detected  and 
removed  a  posteriori,  this  result  can  provide  useful  bounds  on  the  expected  accuracy  of  any  ap¬ 
proach  to  LTS  based  on  repeated  random  elemental  fits  in  terms  of  easily  computable  quantities. 
In  Section  5.3  we  will  present  some  experimental  analysis  of  these  bounds. 

3.  An  Adaptive  Approximation  Algorithm 

In  this  section  we  present  an  approximation  algorithm  for  LTS,  which  we  call  Adaptive-LTS. 
The  input  to  the  algorithm  is  an  n-element  point  set  P  in  M"',  a  coverage  value  h  <  n,  a  residual 
approximation  parameter  e,.  >  0,  and  a  quantile  approximation  parameter  where  0  <  < 

hin.  The  algorithm  also  accepts  bounds  on  the  minimum  and  maximum  slope  coefficients.  If 
these  are  not  given,  the  algorithm  automatically  estimates  them  from  the  point  set.  The  algorithm 
returns  a  hyperplane  that  is  an  {er,  e^)-hybrid  approximation  to  the  optimal  hyperplane  whose 
slope  coefficients  lie  within  the  specified  slope  bounds.  Let  -  h  —  \nsq\  denote  the  reduced 
coverage  value,  as  given  in  the  quantile-approximation  definition  from  Section  1.  Recall  that  we 
assume  that  the  dimension  c/  is  a  fixed  constant. 

This  algorithm  is  based  on  a  general  technique  called  geometric  branch-and-bound,  which 
involves  a  recursive  subdivision  of  the  solution  space  in  search  of  the  optimal  solution.  Geometric 
branch-and-bound  has  been  used  in  other  applications  of  geometric  optimization,  notably  point 
pattern  matching  by  Huttenlocher  and  Rucklidge  [22,  23,  24]  and  Hagedoorn  and  Veltkamp  [25] 
and  image  registration  by  Mount  et  al.  [26]. 

Before  presenting  the  algorithm  we  provide  a  brief  overview  of  the  approach.  Recall  that 
the  solution  to  the  LTS  problem  can  be  represented  as  a  c/-dimensional  vector  =  (y6i, . . .  ,[5d), 
consisting  of  the  coefficients  of  the  estimated  hyperplane.  By  fixing  P  and  h,  each  point  fi  Is 
associated  with  its  LTS  cost,  Xp  -  Ap(P,  h).  This  defines  a  scalar  field  over  called  the  solution 
space.  The  LTS  problem  reduces  to  computing  the  point  p  that  minimizes  this  cost,  that  is, 
the  lowest  point  in  this  field.  This  is  the  context  in  which  geometric  branch-and-bound  is  applied. 

Let  us  begin  with  a  generic  overview  of  how  geometric  branch-and-bound  works.  It  recur¬ 
sively  subdivides  the  solution  space  into  a  collection  of  disjoint  regions,  called  cells.  The  initial 


cell  covers  any  subset  of  the  solution  space  that  is  known  to  contain  the  optimum  solution.  For  a 
given  cell  C,  let  denote  the  smallest  value  of  Ayj,  for  any  p  e  C.  For  the  LTS  problem,  this  is 
the  minimum  LTS  cost  under  the  constraint  that  the  coefficient  vector  of  the  hyperplane  is  chosen 
from  C.  The  algorithm  then  computes  a  lower  bound  and  an  upper  bound  on  Ac-  Letting  A^  and 
Ap  denote  these  bounds,  respectively,  we  have  A^;  <  Ac  <  A^.  (Note  that  AJ  is  an  upper  bound 
on  the  minimum  cost  within  the  cell,  not  an  upper  bound  on  all  the  costs  within  the  cell.)  The  al¬ 
gorithm  also  maintains  the  hyperplane  p’  associated  with  the  smallest  upper  bound  encountered 
in  any  cell  that  has  been  seen  so  far.  If  C’s  lower  bound  exceeds  the  smallest  upper  bound  seen  so 
far,  that  is,  if  A^  >  Ayj',  we  may  safely  eliminate  C  from  further  consideration,  since  any  solution 
that  it  might  provide  cannot  improve  upon  the  current  best  solution.  (The  coefficient  vector  p' 
and  Ayj'  will  be  defined  below  in  the  section  on  Killing  cells)  If  so,  we  say  that  C  is  killed  when 
this  happens.  Otherwise,  C  is  said  to  be  active. 

A  geometric  branch-and-bound  algorithm  runs  in  a  sequence  of  stages.  At  the  start  of  any 
stage,  the  algorithm  maintains  a  collection  of  all  the  currently  active  cells.  During  each  stage,  one 
of  the  active  cells  is  selected  and  is  processed  by  subdividing  it  into  two  or  more  smaller  cells, 
called  its  children.  We  compute  the  aforementioned  upper  and  lower  bounds  for  each  child,  and 
if  possible,  we  kill  them.  Each  surviving  child  cell  is  then  added  to  the  collection  of  active  cells 
for  future  processing.  The  algorithm  terminates  when  no  active  cells  remain. 

The  precise  implementation  of  any  geometric  branch-and-bound  algorithm  depends  on  the 
following  elements: 

•  How  are  cells  represented? 

•  What  is  the  initial  cell? 

•  How  is  a  cell  subdivided? 

•  How  are  the  upper  and  lower  bounds  computed? 

•  Under  what  conditions  is  a  cell  killed? 

•  Among  the  active  cells,  which  cell  is  selected  next  for  processing? 

Let  us  describe  each  of  these  elements  in  greater  detail  in  the  context  of  our  approximation 
algorithm  for  LTS. 

Cells  and  their  representations.  Although  a  solution  to  LTS  is  fully  specified  by  the  ^/-vector 
of  coefficients,  we  will  use  a  more  concise  representation.  Recall  that  a  hyperplane  is  repre¬ 
sented  by  the  equation  y  -  Xj=i  PjXj  +  Pd,  where  P4  is  the  intercept  term.  Rather  than  store 
all  ^-coefficients,  we  will  store  only  the  d  -  \  slope  coefficients  P  -  {P\,. . .  ,Pd-\)  G  We 

will  show  in  Section  4  that  given  such  a  vector,  it  is  possible  to  compute  the  optimum  value  of 
Pd  in  0{n  log  n)  time  by  solving  a  1  -dimensional  LTS  problem.  We  shall  show  below  that  the 
total  processing  time  for  a  cell  is  0{n  log  n),  and  so  this  does  not  affect  the  asymptotic  time  com¬ 
plexity.  Because  the  number  of  cells  that  need  to  be  processed  tends  to  grow  exponentially  with 
dimension  (see  Section  5.2),  this  reduction  in  dimension  can  significantly  reduce  the  algorithm’s 
overall  running  time.  Given  P  e  IR"'  \  let  Ap{P,  h)  denote  the  minimum  LTS  cost  achievable  by 
extending  to  a  cZ-dimensional  vector. 

For  our  algorithm,  a  cell  C  is  a  closed,  axis-parallel  hyperrectangle  in  *.  It  is  repre¬ 
sented  by  a  pair  of  vectors  P  ,P^  e  ',  and  consists  of  the  Cartesian  product  of  intervals 
nf=/  \P] ,P^]-  Let  Ac(F’,  h)  -  min^ggc  h)  denote  the  smallest  LTS  cost  of  any  point  of  C. 
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Sampled  elemental  fits.  We  will  employ  random  elemental  fits  to  help  guide  the  search  algo¬ 
rithm.  Before  the  algorithm  begins,  a  user-specified  number  of  randomly  sampled  elemental  fits 
is  generated.  Each  elemental  fit  is  associated  with  a  (t/  -  l)-dimensional  vector  of  slope  coeffi¬ 
cients.  Let  So  denote  this  set  of  vectors.  For  each  active  cell  C,  let  S  (C)  denote  the  elements  of 
Sq  that  lie  within  C.  Each  active  cell  C  stores  the  elements  of  S(C)  in  a  list  and  the  minimum 
axis-aligned  hyperrectangle  that  contains  S  (C). 

Initial  cell.  The  initial  cell  can  be  specified  by  the  user  or  generated  automatically  by  the  algo¬ 
rithm  from  the  sampled  elemental  fits.  In  the  latter  case,  the  algorithm  generates  the  initial  cell 
by  computing  an  axis-aligned  hyperrectangle  in  that  encloses  some  or  all  of  the  elements 
of  Sq.  Because  any  hyperrectangle  enclosing  Sq  will  be  heavily  influenced  by  outliers,  the  algo¬ 
rithm  provides  an  option  to  compute  a  subrectangle  that  contains  a  user-specified  fraction  of  the 
points  of  5o. 

This  subrectangle  is  computed  as  follows.  Let  m  denote  the  number  of  points  of  Sq,  and 
let  m'  <  m  denote  the  user-specified  number  of  points  to  lie  within  the  subrectangle.  Define 
p  —  (m' For  i  =  1, ...  ,c/  -  1,  project  the  points  of  5,_i  onto  the  /th  coordinate  axis 
and  compute  the  smallest  interval  that  contains  a  fraction  of  p  of  these  projected  points.  (This 
can  be  done  easily  in  0(mlogm)  time  by  sorting  and  sweeping  through  the  projected  points.) 
Remove  all  the  points  from  5,_i  that  lie  outside  of  this  interval  and  increment  i.  After  doing 
this  for  all  values  of  i,  return  the  smallest  axis-aligned  rectangle  that  contains  the  final  set  Sd-i. 
Since  each  iteration  keeps  a  fraction p  the  elements  of  5,_i,  it  follows  that  |5,j  is  roughly  mp‘.  On 
completion,  the  number  of  points  that  remain  is  mp‘^^^  =  m',  as  desired.  (Some  additional  care 
is  needed  when  rounding  reals  to  integers  to  obtain  exactly  m'  points.) 


Figure  2:  Computing  the  initial  cell  when  d  =  3,  m  =  20,  and  m'  =  5,  and  p  =  (5/20)*^^  =1/2.  Each  point  represents  a 
randomly  sampled  elemental  fit,  and  each  iteration  computes  the  smallest  interval  containing  a  fraction  p  of  the  remaining 
points. 


In  our  experiments  (see  Section  5)  we  found  that  the  value  m'  -  cmihjnY  worked  well, 
where  c  >  1  is  a  small  constant.  The  rationale  behind  this  choice  is  that,  since  each  elemental 
fit  is  based  on  a  sample  of  d  points,  and  each  point  has  a  probability  of  roughly  hjn  of  being 
an  inlier,  a  fraction  of  roughly  {hlrif  of  the  m  elemental  fits  will  be  based  entirely  on  inkers. 
Ideally,  the  slope  coefficients  of  these  fits  will  be  well  clustered,  and  hence  for  a  suitable  constant 
c  >  1,  the  smallest  hyperrectangle  containing  cm(hfnY  sampled  slopes  is  a  good  candidate  to 
include  the  LTS  optimum. 

Cell  subdivision.  We  will  discuss  how  active  cells  are  selected  for  processing  below.  Once  a 
cell  C  is  selected,  it  is  subdivided  into  two  hyperrectangles  by  an  axis  parallel  hyperplane.  This 
hyperplane  is  chosen  as  follows.  First,  if  S  (C)  is  not  empty,  then  the  algorithm  determines  the 
longest  side  length  of  the  bounding  rectangle  containing  S  (C)  (with  ties  broken  arbitrarily).  It 
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then  sorts  the  vectors  of  S  (C)  along  the  axis  that  is  parallel  to  this  side,  and  splits  the  set  by  a  hy¬ 
perplane  that  is  orthogonal  to  this  side  and  passes  through  the  median  of  the  sorted  sequence  (see 
Fig.  3(a)).  If  S  (C)  is  empty,  the  algorithm  bisects  the  cell  through  its  midpoint  by  a  hyperplane 
that  is  orthogonal  to  the  cell’s  longest  side  length  (see  Fig.  3(b)).  The  resulting  cells,  denoted  Co 
and  Cl  in  the  figure,  are  called  C’s  children. 


B{C) 

• 

*  • 

•  * 

• 

• 

Co 

C 


Co 


Cl 


c 


(b) 


Figure  3:  Subdividing  a  cell  (a)  when  S  (C)  is  nonempty  and  (b)  when  S  (C)  is  empty. 


An  example  of  the  subdivision  process  is  shown  below  in  Fig.  4  for  a  2-dimensional  solution 
space.  In  (a)  we  show  a  cell  C  and  its  samples  S  (C).  In  (b)  we  show  the  subdivision  that  results 
by  repeatedly  splitting  each  cell  through  its  median  sample.  In  (c)  we  show  the  effect  of  two 
additional  steps  of  subdivision  by  bisecting  the  cell’s  longest  side.  (This  assumes  that  all  the 
cells  are  selected  for  processing  and  none  are  killed.)  Note  that  the  subdivision  has  the  desirable 
property  that  it  tends  to  produce  more  cells  in  regions  of  space  where  the  density  of  samples  is 
higher. 


I 

I 

X-, 

1 

r 

X 

(b) 


(c) 


Figure  4:  Example  of  the  subdivision  process:  (a)  initial  cell  and  samples,  (b)  sample-based  subdivision,  (c)  two  more 
stages  of  midpoint  subdivision. 


Upper  and  lower  bounds.  Recall  that  the  upper  bound  for  a  cell  C,  denoted  A^(C),  is  an  upper 
bound  on  the  smallest  LTS  cost  among  all  hyperplanes  whose  slope  coefficients  lie  within  C.  To 
compute  such  an  upper  bound,  it  suffices  to  sample  any  representative  slope  from  within  the  cell 
and  then  compute  the  y-intercept  of  the  hyperplane  with  this  slope  that  minimizes  the  LTS  cost. 
We  shall  show  below  that  it  suffices  to  use  the  reduced  coverage  value  h~  defined  earlier. 

Our  algorithm  chooses  the  representative  slope  as  follows.  If  S  (C)  is  nonempty,  we  take  the 
median  sample  point  that  was  chosen  in  the  cell-subdivision  process.  Otherwise,  we  take  the 
cell’s  midpoint.  As  mentioned  above,  this  provides  only  the  slope  coefficients.  In  Section  4,  we 
will  show  how  to  compute  the  optimum  intercept  value  in  0(n  log  n)  time.  To  further  improve 


11 


the  upper  bound,  we  then  apply  a  user-specified  number  of  C-steps.  (We  used  two  C-steps  in  our 
experiments.) 

The  computation  of  the  lower  bound  for  the  cell  C,  denoted  A  (C),  is  more  complex  than 
the  upper  bound.  Our  approach  is  similar  to  the  upper  bound  computation,  but  rather  than  com¬ 
puting  the  optimum  intercept  for  a  specific  slope,  we  compute  the  optimum  intercept  under  the 
assumption  that  the  slope  could  be  any  point  of  C  (using  the  full  coverage  value  h,  not  h~).  In 
Section  4,  we  will  show  how  to  do  this  in  0(n  log  n)  time  by  solving  a  problem  that  we  call  the 
interval  LTS  problem. 


Killing  cells.  Recall  that  our  objective  is  to  report  a  hyperplane  p  that  satisfies 


^p{P,h-)  <  (l+e,)A*(P,/i). 

Let  p'  denote  the  coefficient  vector  associated  with  the  smallest  value  of  Ap'{P,lr)  encountered 
so  far  in  the  search.  (Note  that  this  cost  is  computed  using  the  reduced  coverage  value  IK )  We 
assert  that  a  cell  C  can  be  killed  if 


A-(C)  > 


Ap<P,h-) 

1  -h  £f 


To  justify  this  assertion,  observe  that  any  coefficient  vector  p  lying  within  C  satisfies 


(2) 


{\  +  Sr)Ap{P,h)  >  (l-Her)A“(C)  >  Ap’(Ph-). 

Thus,  the  LTS  cost  of  p  is  sufficiently  high  that  it  cannot  contradict  the  hypothesis  that  p'  is  a 
valid  hybrid  approximation.  Therefore,  it  is  safe  to  eliminate  C  from  further  consideration.  (Note 
that  future  iterations  might  change p' ,  but  only  in  a  manner  than  would  decrease  Ap'(P,  hr).  Thus, 
the  decision  to  kill  a  cell  never  needs  to  be  reconsidered.) 

It  is  not  hard  to  show  that  whenever  the  ratio  between  a  cell’s  upper  bound  and  lower  bound 
becomes  smaller  than  (1  -H  Br),  the  cell  will  be  killed.  This  is  because  P'  was  selected  from  the 
processed  cell  having  the  smallest  value  of  A^,  and  so  such  a  cell  satisfies 


A-(C)  > 


A^(C) 

1  +  £f 


> 


ApiP,h-) 

1  -h  Bf 


This  implies  that  C  satisfies  Eq.  (2)  and  will  be  killed.  It  follows  that  the  algorithm  terminates 
once  all  active  cells  have  been  subdivided  to  such  a  small  size  that  this  condition  holds. 


Cell  processing  order.  An  important  element  in  the  design  of  an  efficient  geometric  branch- 
and-bound  algorithm  is  the  order  in  which  active  cells  are  selected  for  processing.  When  a 
cell  is  selected  for  processing,  there  are  two  desirable  outcomes  we  might  hope  for.  First,  the 
representative  hyperplane  from  this  cell  will  have  a  lower  LTS  cost  than  the  best  hyperplane  p' 
seen  so  far.  This  will  bring  us  closer  to  the  optimum  LTS  cost,  and  it  has  the  benefit  that  future 
cells  are  more  likely  to  be  killed  by  the  condition  of  Eq.  (2).  Second,  the  lower  bound  of  this  cell 
will  be  so  high  that  the  cell  will  be  killed,  thus  reducing  the  number  of  active  cells  that  need  to 
be  considered.  This  raises  the  question  of  whether  there  are  easily  computable  properties  of  the 
active  cells  that  are  highly  correlated  with  either  of  these  desired  outcomes.  We  considered  the 
following  potential  criteria  for  selecting  the  next  cell: 

(1)  Max-samples:  Select  the  active  cell  that  contains  the  largest  number  of  randomly  sampled 
elemental  fits,  that  is,  the  cell  C  that  maximizes  \S  (C)|. 
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(2)  Min-lower-bound:  Select  the  active  cell  having  the  smallest  lower  bound,  A“(C). 

(3)  Min-upper-bound:  Select  the  active  cell  having  the  smallest  upper  bound,  A^(C). 

(4)  Oldest-cell:  Select  the  active  cell  that  has  been  waiting  the  longest  to  be  processed. 

Based  on  our  experience,  none  of  these  criteria  alone  is  the  best  choice  throughout  the  entire 
search.  At  the  start  of  the  algorithm,  when  many  cells  have  a  large  number  of  samples,  crite¬ 
rion  (1)  tends  to  work  well  because  it  favors  cells  that  are  representative  of  typical  elemental 
fits.  But  as  the  algorithm  proceeds,  and  cells  are  further  subdivided,  the  number  of  samples  per 
cell  decreases.  In  such  cases,  this  criteria  cannot  distinguish  one  cell  from  another.  Criteria  (2) 
and  (3)  are  generally  good  in  these  middle  phases  of  the  algorithm.  But  when  these  two  crite¬ 
ria  are  used  exclusively,  cells  that  are  far  from  the  LTS  optimum  are  never  processed.  Adding 
criterion  (4)  helps  remedy  this  unbalance. 

Our  approach  to  selecting  the  next  active  cell  employs  an  adaptive  strategy.  The  algorithm 
assigns  a  weight  to  each  of  the  four  criteria.  Initially,  all  the  criteria  are  given  the  same  weight.  At 
the  start  of  each  stage,  the  algorithm  selects  the  next  criteria  randomly  based  on  these  weights. 
That  is,  if  the  current  weights  are  denoted  w\,. . .  ,W4,  criterion  i  is  selected  with  probability 

w,7(wi  H - 1-W4).  We  select  the  next  cell  using  this  criterion.  After  processing  this  cell,  we  either 

reward  or  penalize  this  criterion  depending  on  the  outcome.  First,  we  compute  the  ratio  between 
the  cell’s  new  lower  bound  and  the  best  LTS  cost  seen  so  far,  that  is,  =  A^(C)IAp>(P,h~). 
Observe  that  is  nonnegative  and  increases  as  the  lower  bound  approaches  the  current  best 
LTS  cost.  With  probability  min(l,p  )  we  reward  this  choice  by  increasing  the  current  criterion’s 
weight  by  50%,  and  otherwise  we  penalize  it  by  decreasing  it  by  10%.  Second,  we  compute  the 
inverse  of  the  ratio  between  the  cell’s  new  upper  bound  and  the  best  LTS  cost  seen  so  far,  that  is, 

-  Ap’{P,h^)l ^{C).  Again,  p^  is  nonnegative  and  increases  as  the  upper  bound  approaches 
the  best  LTS  cost.  With  probability  min(l,p^)  we  reward  this  choice  by  increasing  the  current 
criterion’s  weight  by  50%,  and  otherwise  we  decrease  it  by  10%. 

The  complete  algorithm.  The  algorithm  begins  by  generating  the  initial  sample  5  0  of  random 
elemental  fits,  and  the  initial  cell  Co  is  computed  as  described  above  and  is  labeled  as  active. 
We  then  repeat  the  following  stages  until  no  active  cells  remain.  First,  the  above  cell-selection 
strategy  is  applied  to  select  and  remove  the  next  active  cell  C.  By  the  subdivision  process,  we 
split  this  cell  into  two  children  cells  Co  and  Ci.  The  set  of  samples  S(C)  is  then  partitioned 
among  these  two  cells  as  5  (Co)  and  5(Ci),  respectively.  Next,  for  each  of  these  children,  the 
lower  and  upper  bounds  are  computed  as  described  earlier.  If  the  upper  bound  cost  is  smaller 
than  the  current  best  LTS  cost,  then  we  update  the  current  best  LTS  fit  ji'  and  its  associated  cost. 
For  each  child  cell,  we  test  whether  it  can  be  killed  by  applying  the  condition  of  Eq.  (2).  If  the 
cell  is  not  killed,  it  is  added  to  the  set  of  active  cells.  In  either  case,  the  selection  criteria  that 
was  used  to  determine  this  cell  is  rewarded  or  penalized,  as  described  above.  The  algorithm 
terminates  when  no  active  cells  remain.  On  termination,  we  output  the  current  best  fit  p  and  its 
associated  cost. 

By  the  remarks  made  earlier,  any  cell  that  is  killed  cannot  contradict  the  hypothesis  that  fi'  is 
an  approximation  to  the  optimal  solution  (subject  to  the  assumption  that  the  optimum  lies  within 
the  initial  slope  bounds).  The  algorithm’s  correctness  follows  because  on  termination,  when  all 
the  active  cells  have  been  killed,  is  a  valid  (e^,  e,j)-hybrid  approximation. 

Unfortunately,  it  is  not  easy  to  provide  a  good  asymptotic  bound  on  the  algorithm’s  overall 
running  time.  Unlike  Fast-LTS,  which  runs  for  a  fixed  number  of  iterations  irrespective  of  the 
structure  of  the  data  set,  the  branch-and-bound  algorithm  adapts  its  behavior  in  accordance  with 
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the  structural  properties  of  the  point  set  and  the  desired  accuracy  of  the  final  result.  In  Section  5 
we  provide  evidence  that  the  algorithm  terminates  very  rapidly  for  well-conditioned  inputs,  but 
it  can  take  significantly  longer  when  the  data  are  poorly  conditioned. 

Even  though  we  cannot  bound  the  total  number  of  stages,  we  can  bound  the  running  time  of 
each  stage.  In  the  next  section  we  will  show  that  the  upper  and  lower  bounds,  A^(C)  and  A  (C), 
can  each  be  computed  in  0(n  log  n)  time,  where  n  is  the  number  of  points.  Otherwise,  the  most 
time  consuming  part  of  each  stage  is  the  time  needed  to  compute  the  next  active  cell.  If  m  denotes 
the  maximum  number  of  active  cells  at  any  time,  and  we  simply  store  them  in  a  list,  the  running 
time  of  each  stage  is  0(m  +  nlogn).  Through  the  use  of  more  sophisticated  data  structures,^  it  is 
possible  to  reduce  the  running  time  to  0{\og  m  +  n  log  n).  Since  we  expect  m  to  be  smaller  than 
n,  we  did  not  bother  to  implement  this. 

4.  Computing  Upper  and  Lower  Bounds 

In  this  section  we  describe  how  to  compute  the  upper  and  lower  bounds  for  a  given  cell  C 
in  the  geometric  branch-and-bound  algorithm  described  in  the  previous  section.  Recall  that  P 
and  h  denote  the  point  set  and  coverage,  respectively,  and  that  each  hyperplane  is  represented 
by  its  slope  coefficients  (J3i, . . .  ,l3d-i)-  C  is  represented  by  a  pair  of  vectors  P  e  and 

consists  of  the  Cartesian  product  of  intervals  ,P^],  which  we  denote  by  \J3  Let 

Ac(E,  h)  -  minyjec  Ay3(P,  h)  denote  the  smallest  LTS  cost  of  any  point  of  C.  Our  objective  is  to 
compute  upper  and  lower  bounds  on  this  quantity. 

In  the  previous  section  (see  “Upper  and  lower  bounds”)  we  explained  how  the  algorithm 
samples  a  representative  vector  of  slope  coefficients  from  C.  To  compute  the  cell’s  upper  bound, 
A^(C),  we  compute  the  value  of  the  y-intercept  coefficient  that,  when  combined  with  the  repre¬ 
sentative  vector  of  slope  coefficients,  produces  the  hyperplane  that  minimizes  the  LTS  cost.  It  is 
straightforward  to  show  that  this  can  be  reduced  to  solving  a  1 -dimensional  LTS  problem,  which 
can  then  be  solved  in  0{n  log  n)  time  [1].  We  will  present  the  complete  algorithm,  since  it  will  be 
illustrative  when  we  generalize  this  to  the  more  complex  problem  of  computing  the  cell’s  lower 
bound. 

Recall  from  Section  3  that,  when  computing  A^(C),  we  use  the  reduced  coverage  value  h  . 
To  simplify  notation,  we  will  refer  to  this  as  h  throughout  this  section.  Let  p,  =  (Xi,yi)  denote 
the  ith  point  of  P,  where  Xi  e  and  y,  e  M.  The  squared  residual  of  p,  with  respect  to  a 
hyperplane  p  —  {Pi,. . .  ,Pd),  which  we  denote  by  r^(P),  is 

rf(P)  =  ()'.■- (3) 

Let  bi  -  ji  -  Xj=i  PjXij-  We  can  visualize  h,  as  the  intersection  of  the  y-axis  and  the  hyperplane 
passing  through  p,  whose  slope  coefficients  match  p  (see  Lig.  5(a)).  Let  B  -  {by, ,  bn]  denote 
this  set  of  intercepts.  Lor  any  hyperplane  P,  the  squared  distance  (h,  -  Pdf'  between  intercepts 
is  clearly  equal  to  r^(j8).  Therefore,  the  desired  intercept  Pd  of  the  LTS  hyperplane  whose  slope 
coefficients  match  )8’s  is  the  point  on  the  y-axis  that  minimizes  the  sum  of  squared  distances  to  its 
h  closest  points  in  B.  That  is,  computing  Pd  reduces  to  solving  the  ( 1 -dimensional)  LTS  problem 
on  the  n-element  set  B. 


^Four  priority  queues  are  maintained,  one  for  each  of  the  cell-selection  criteria.  On  creation,  each  cell  is  inserted  into 
all  four  queues,  and  whenever  a  cell  is  selected  for  processing,  it  is  removed  from  all  four  queues.  By  storing  each  priority 
queue  as  a  balanced  binary  search  tree  [27]  sorted  according  to  the  selection  criterion  together  with  cross-reference  links, 
it  is  possible  to  perform  each  operation  in  0(log  m)  time. 
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Figure  5:  Computing  the  optimum  y-intercept  for  a  given  slope  fi'.  (a)  projecting  the  points  and  (b)  the  subset 
{bi, ....  bi^h-\ I- 


Before  presenting  the  algorithm  for  the  1 -dimensional  LTS  problem,  we  begin  with  a  few 
observations.  First,  let  us  assume  that  the  points  of  B  are  sorted  in  nondecreasing  order  and  have 
been  renumbered  so  that  bi  <  ■  ■  ■  <  b„.  Clearly,  the  h-element  subset  B  that  minimizes  the 
sum  of  squared  distances  to  any  point  on  the  y-axis  consists  of  h  consecutive  points  of  B.  For 
1  <  i  <  n  -  h  +  1,  define  B,  =  {bi, . . .  ,bi+h-\]  (see  Fig.  5(b)).  The  desired  upper  bound  is  just 
the  minimum  among  these  n  -  h  +  1  least-squares  costs.  This  can  be  computed  by  brute  force  in 
0{hn)  -  OirP')  time,  but  we  will  present  an  0(n)  time  incremental  algorithm. 

The  point  that  minimizes  the  sum  of  squared  deviates  within  each  subset  B,  is  just  the  mean 
of  the  subset,  which  we  denote  by  The  least-squares  cost  is  the  subset’s  standard  deviation, 
which  we  denote  by  cr,-.  Define  sum(0  =  *  bk  and  ssq(/)  =  Yj‘k=P^  Given  these,  it  is  well 

known  that  we  can  compute  the  mean  and  standard  deviation  as 

SLM(()  ,  /sSQ(0-(SIJM(()V/Or’ 

'  -ir  ^ 

Thus,  it  sufhces  to  show  how  to  compute  sum(/)  and  ssq(0,  for  l</<n-hH-l.  We  begin 
by  computing  the  values  of  sum(1)  and  ssq(1)  in  0{h)  time  by  brute  force.  As  we  increment  the 
value  of  i,  we  can  update  the  quantities  of  interest  in  (9(1)  time  each.  In  particular, 

sum(/  -I-  1)  <—  sum(0  -h  {bi+h  -  bi)  and  ssQ(i  -H  1)  <—  ssq(0  +  -  bf^  . 

Thus,  given  the  initial  0{n  log  n)  time  to  sort  the  points  and  the  0{h)  time  to  compute  sum(1)  and 
ssq(1),  the  remainder  of  the  computation  runs  in  constant  time  for  each  i.  Therefore,  the  overall 
running  time  is  0{n  log  n  +  h  +  {n  -  h  +  \))  -  0(n  log  n).  The  final  LTS  cost  is  just  the  minimum 
of  the  cr,’s. 

Next,  we  will  show  how  to  generalize  this  to  the  more  complex  task  of  computing  the  lower 
bound,  A  (C).  Recall  that  the  objective  is  to  compute  a  lower  bound  on  the  LTS  cost  of  any  hyper¬ 
plane  whose  slope  coefficients  lie  within  the  {d  -  l)-dimensional  hyperrectangle  [)8  ,j8^].  We 
will  demonstrate  that  this  can  be  reduced  to  the  problem  of  solving  a  variant  of  the  1 -dimensional 
LTS  problem  over  a  collection  of  n  intervals,  called  the  interval  LTS  problem,  and  we  will  present 
an  0(n  log  n)  time  algorithm  for  this  problem. 

Let  (Xi,yi)  denote  the  coordinates  of  p,  e  P,  where  Xi  e  andy,  e  K.  For  any  hyperplane 
P,  recall  the  formula  of  Eq.  (3)  for  the  squared  residual  r^{P).  We  seek  the  value  of  fid  that 
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minimizes  the  sum  of  the  h  smallest  squared  residuals  subject  to  the  constraint  that  j8’s  slope 
coefficients  lie  within  \Ji  Although  we  do  not  know  the  exact  values  of  the  these  slope 

coefficients,  we  do  have  bounds  on  them.  Using  the  fact  that  <  Pj  <  ySt,  for  1  <  y  <  c/  -  1,  we 
can  derive  upper  and  lower  bounds  on  the  second  term  in  the  squared  residual  formula  (Eq.  (3)) 
by  a  straightforward  application  of  interval  arithmetic  [28].  In  particular,  for  !</<«, 


bt 

b- 


d-\ 

7=1 

d-\ 

>1 


where/?}  =  |  yji 
where  fi’’  “  | 


if  Xij  >  0, 
if  Xij  <  0. 

if  Xij  >  0, 
if  Xij  <  0. 


(5) 

(6) 


These  quantities  are  analogous  to  the  values  h,  used  in  the  upper  bound,  subject  to  the  uncertainty 
in  the  slope  coefficients.  It  is  easy  to  verify  that  <  bf,  and  for  any  whose  slope  coefficients 
lie  within  [fi  the  value  of  d(/3)  is  of  the  form  (J3d-bif',  for  some  h,  e  [bj,  bf].  Analogous  to 

the  case  of  the  upper  bound  computation,  this  is  equivalent  to  saying  that  any  hyperplane  passing 
through  Pi  along  the  direction  of  jS  intersects  the  y-axis  at  a  point  in  the  interval  [bj,bf]  (see 
Fig.  6).  For  the  purposes  of  obtaining  a  lower  bound  on  the  FTS  cost,  we  may  take  h,  to  be  the 
closest  point  in  [b~,bf  ]  to  the  hypothesized  intercept  jSj.  Thus,  we  have  reduced  our  lower  bound 
problem  to  computing  the  value  of  fid  that  minimizes  the  sum  of  the  smallest  h  squared  distances 
to  a  collection  of  n  intervals.  This  is  a  generalization  of  the  1 -dimensional  FTS  problem,  but 
where  instead  of  points,  we  now  have  intervals. 


Figure  6:  Reducing  the  lower  bound  to  the  interval  LTS  problem. 


More  formally,  we  define  the  interval  LTS  problem  as  follows.  Given  a  point  b  and  a  closed 
interval,  define  the  distance  from  b  to  this  interval  to  be  the  minimum  distance  from  b  to  any 
point  of  the  interval.  (The  distance  is  zero  if  b  is  contained  within  the  interval.)  Given  a  set  B 
of  n  nonempty,  closed  intervals  on  the  real  line  and  a  coverage  value  h,  for  any  real  b,  define 
r[i]{b,  E)  to  be  the  distance  from  b  to  the  /th  closest  interval  of  B.  Also,  define 


^h{B,h) 


1 

h-l 
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The  interval  FTS  problem  is  that  of  computing  the  minimum  value  of  Ai,(B,  h),  over  all  reals  b, 
which  we  denote  by  A(B,h).  In  summary,  we  obtain  the  following  reduction. 

Lemma  3.  Consider  an  n-element  point  set  P  in  a  coverage  h,  and  a  cell  C  defined  by  the 
slope  bounds  p  e  Let  B  denote  the  set  of  n  intervals  [bf  ,bf]  defined  in  Eqs.  (5)  and  (6) 

above.  Then  for  any  jS  e  C,  A(B,  h)  <  Ap(P,  h). 
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It  suffices  to  show  how  to  solve  the  interval  LTS  problem.  Our  algorithm  is  a  generalization 
of  the  algorithm  for  the  upper-bound  processing.  First,  we  will  identify  a  collection  of  0(n) 
subsets  of  intervals  of  size  h,  such  that  the  one  with  the  lowest  cost  will  be  the  solution  to  the 
interval  LTS  problem.  Second,  we  will  show  how  to  compute  the  LTS  costs  of  these  subsets 
efficiently  through  an  incremental  approach.  To  avoid  discussion  of  messy  degenerate  cases,  we 
will  make  the  general-position  assumption  that  the  interval  endpoints  of  B  are  distinct.  This  can 
always  be  achieved  through  an  infinitesimal  perturbation  of  the  endpoint  coordinates. 

Recall  that  in  the  upper-bound  algorithm,  we  computed  the  LTS  cost  of  each  h  consecutive 
points.  In  order  to  generalize  this  concept  to  the  interval  case,  we  first  sort  the  left  endpoints  B  in 
nondecreasing  order,  letting  denote  the  jth  smallest  left  endpoint.  Similarly,  we  sort  the  right 
endpoints,  letting  denote  the  /th  smallest  right  endpoint.  Define  /,  to  be  the  (possibly  empty) 
interval  define  to  be  the  (possibly  empty)  subset  of  intervals  of  B  whose 

right  endpoints  are  at  least  b^^y  and  whose  left  endpoints  are  at  most  To  avoid  future 

ambiguity  with  the  term  “interval,”  we  will  refer  to  any  interval  formed  from  the  endpoints  of  B 
as  a  span,  and  each  interval  /,  is  called  an  h-span  (see  Fig.  7).  We  next  present  a  technical  result 
about  the  sets  Bi.  The  proof  is  given  in  Appendix  A. 
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Figure  7:  For  h  =  A,  the  /i-spans  /,.  The  intervals  of  the  subset  B3  corresponding  to  B  =  [^[3]>  ^[6]]  highlighted. 

Lemma  4.  Let  (B,  h)  be  an  instance  of  the  interval  LTS  problem,  and  let  Bi  be  the  subsets  of  B 
defined  above,  for  1  <  i  <  n  —  h  +  1.  If  for  any  i,  b^-^  >  then  A(B,  h)  —  0.  Otherwise,  each 

set  Bi  consists  of  exactly  h  intervals  of  B. 

By  this  lemma  we  may  assume  henceforth  that  b^-^  >  by^y^  ^^  for  all  i,  since  if  we  ever 
detect  that  this  is  not  the  case,  we  may  immediately  report  that  A(B,  h)  is  zero  and  terminate 
the  algorithm.  Intuitively,  the  subsets  Bi  are  chosen  to  be  the  “tightest”  subsets  of  B  having  h 
elements.  Our  next  lemma  shows  that  for  the  purposes  of  solving  the  interval  LTS  problem,  it 
suffices  to  consider  just  these  subsets.  For  1  <  i  <  n  -  h  +  1,  define  A(B,  ,  h)  to  be  the  LTS  cost  of 
the  interval  LTS  problem  on  only  the  h  intervals  of  Bi. 

Lemma  5.  Given  an  instance  (B,h)  of  the  interval  LTS  problem,  A(B,h)  —  A{Bi,h)  for  some  i, 
where  \  <i<n  —  h+\. 

The  proof  is  given  in  Appendix  A.  It  follows  that  in  order  to  solve  the  interval  LTS  problem, 
it  suffices  to  compute  the  LTS  cost  of  each  of  the  subsets  B,,  for  1  <i<n-h+l.  The  algorithm 
is  a  generalization  of  the  upper  bound  algorithm,  where  intervals  play  the  role  of  individual 
points.  The  details  of  the  algorithm  are  relatively  technical  but  straightforward.  A  complete 
presentation  of  the  algorithm  and  its  running  time  analysis  are  presented  in  Appendix  B.  The 
following  theorem  summarizes  the  main  result  of  this  section. 

Theorem  2.  Given  a  collection  B  of  n  intervals  on  the  real  line  and  a  coverage  h,  it  is  possible 
to  solve  the  interval  LTS  problem  on  B  in  0(n  log  n)  time. 
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5.  Empirical  Analysis 


In  this  section  we  present  a  number  of  experiments  comparing  the  accuracy  and  running 
time  of  Adaptive-LTS  and  Fast-LTS  on  various  input  distributions.  We  consider  input  sets  from 
various  distributions,  in  various  dimensions,  and  with  various  degrees  of  numerical  conditioning. 

Before  presenting  our  results,  let  us  remark  on  the  structure  that  is  common  to  all  our  exper¬ 
iments.  We  implemented  both  Rousseeuw  and  van  Driessen’s  Fast-LTS  and  our  Adaptive-LTS 
in  C-n-  (compiled  by  g++  4.5.2  with  optimization  level  “-03”  and  running  on  a  dual-core  Intel 
Pentium-D  processor  under  Ubuntu  Linux  1 1.04).  In  all  experiments  we  let  the  algorithm  choose 
the  initial  cell  bounds  using  the  approach  described  under  “Initial  Cell”  in  Section  3. 

Both  Fast-LTS  and  Adaptive-LTS  operate  in  a  series  of  stages.  Each  stage  of  our  implemen¬ 
tation  of  Fast-LTS  consists  of  a  single  random  elemental  fit  followed  by  two  C-steps.  (This  is 
the  same  number  used  by  Rousseeuw  and  van  Driessen  in  [15].)  Although  Adaptive-LTS  has 
a  termination  condition,  for  the  sake  of  comparison  with  Fast-LTS  we  disabled  the  termination 
condition  in  most  of  our  experiments,  and  instead  ran  the  two  algorithms  for  the  same  fixed  num¬ 
ber  of  stages  and  compared  their  performance  on  a  stage-by-stage  basis.  For  the  same  reason, 
we  omitted  the  final  post-processing  phase  of  Fast-LTS. 

5.7.  Convergence  Rates  for  Synthetically  Generated  Data 

For  these  experiments,  we  generated  a  data  set  consisting  of  1000  points  in  for  d  e  (2, 3). 
In  order  to  simulate  point  sets  that  contain  outliers,  each  data  set  consisted  of  the  union  of  points 
drawn  from  two  distributions.  A  fixed  fraction  of  points  were  sampled  independently  from  an  in- 
lier  distribution,  which  consists  of  points  that  lie  close  to  a  randomly  generated  hyperplane.  The 
remaining  points  were  sampled  from  some  different  distribution,  called  the  outlier  distribution. 
These  distributions  were  varied  to  explore  different  numbers  of  dimensions  and  different  degrees 
of  numerical  conditioning. 


(b) 

Figure  8:  The  distributions  hyp+uniformi  and  hyp+half-unifi  . 


Hyp+Half-Unifi 


Hyp-fUnifi 


HYP-HUNiFORM  (2-dimensional):  Our  first  experiment  involved  a  2-dimensional,  well-conditioned 
distribution,  consisting  of  1000  points  (one  independent  and  one  dependent  variable).  First,  a  line 
of  the  form  y  -  fix  +  j32  was  generated,  where  fix  was  sampled  uniformly  from  [-0.25,  -1-0.25], 
and /32  was  sampled  uniformly  from  [-0.1,  -i-O.l].  Next,  550  points  (55%  inkers)  were  generated 
with  the  x-coordinates  sampled  uniformly  from  [-l,-i-l],  and  each  y-coordinate  generated  by 
computing  the  value  of  the  line  at  these  x-coordinates  and  adding  a  Gaussian  deviate  with  mean 
0  and  standard  deviation  0.01.  The  remaining  450  points  (45%  outliers)  were  sampled  uniformly 
from  the  square  [-1,  H-1]^.  Because  the  inkers  were  drawn  from  a  hyperplane  (actually  a  line 
in  this  case)  and  the  outliers  were  drawn  from  a  uniform  distribution,  we  call  this  distribution 
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HYP+UNiFORM  OF  generally  hyp+uniform^  to  denote  the  variant  with  k  independent  variables  (see 

Fig.  8). 

We  ran  both  Fast-LTS  and  Adaptive-LTS  on  the  resulting  data  set  for  500  stages.  Although 
in  most  of  our  experiments,  we  set  the  coverage  value  to  50%,  to  make  this  instance  particularly 
easy  to  solve,  we  ran  both  algorithms  with  a  relatively  low  coverage  of  h  —  0.1  (that  is,  10% 
inliers).  In  Fig.  9(a)  and  (b)  we  show  the  results  for  Fast-LTS  and  Adaptive-LTS,  respectively, 
where  the  horizontal-axis  indicates  the  stage  number  and  the  vertical-axis  indicates  the  LTS 
cost.  Note  that  the  vertical-axis  is  on  a  logarithmic  scale.  Each  (small  square)  point  of  each 
plot  indicates  the  LTS  cost  of  the  sampled  hyperplane.  The  upper  (solid  blue  or  “Best”)  curve 
indicates  the  minimum  LTS  cost  of  any  hyperplane  seen  so  far.  The  lower  (broken  red  or  “Lower 
bound”)  of  Fig.  9(b)  indicates  the  minimum  lower  bound  of  any  active  cell.  The  optimum  LTS 
cost  lies  somewhere  between  these  two  curves,  and  therefore,  the  ratio  between  the  current  best 
and  current  lower  bound  limits  the  maximum  relative  error  that  would  result  by  terminating  the 
algorithm  and  outputting  the  current  best  fit. 


Fast-LTS:  Cost  per  Stage  (Hyp+Uniform) 


Adaptive-LTS:  Cost  per  Stage  (Hyp-rUniform) 
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Figure  9:  LTS  cost  versus  stage  number  for  the  hyp+uniform  distribution  (tf  =  2,  /?  =  0.1)  for  (a):  Fast-LTS  and  (b): 
Adaptive-LTS. 


As  mentioned  earlier,  this  is  a  well-conditioned  distribution,  and  therefore  it  not  surprising 
that  both  algorithms  converge  very  rapidly  to  a  solution  that  is  nearly  optimal.  Given  only  infor¬ 
mation  on  the  best  LTS  cost,  a  user  of  Fast-LTS  would  not  know  when  to  terminate  the  search.  In 
contrast,  a  user  of  Adaptive-LTS  would  know  that  after  just  75  iterations,  the  relative  difference 
between  the  best  upper  and  lower  bounds  is  smaller  than  1%.  Thus,  terminating  the  algorithm 
at  this  point  would  imply  an  approximate  error  of  at  most  1%.  It  is  also  interesting  to  note  that 
Fast-LTS  generates  elemental  fits  at  random,  and  hence  the  pattern  of  plotted  points  remains 
consistently  random  throughout  all  500  stages.  In  contrast,  the  sampled  hyperplanes  generated 
by  Adaptive-LTS  shows  signs  of  converging  to  a  subregion  of  the  configuration  space  where  the 
most  promising  solutions  reside. 

HYP-FUNiFORM  (3-dimensional):  Our  second  experiment  involved  the  same  distribution,  but  in 
(two  independent  and  one  dependent  variable)  and  using  a  larger  value  of  h.  First,  a  plane  of 
the  form  y  =  /3iXi  +  ^2X2  +  Pt,  was  generated,  where  Pi  and  P2  were  sampled  uniformly  from 
[-0.25,  -fO.25],  and  was  sampled  uniformly  from  [-0.1,  +0.1].  As  before,  550  points  (55% 
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inliers)  were  generated  with  the  xi-  and  ;ii:2-coordinates  sampled  uniformly  from  [-1,  +1],  and  the 
y-coordinates  generated  by  computing  the  value  of  the  plane  at  these  ;ic-coordinates  and  adding 
a  3-dimensional  deviate  whose  coordinates  are  independent  Gaussians  of  mean  0  and  standard 
deviation  0.01.  The  remaining  450  points  (45%  outliers)  were  sampled  uniformly  from  the  cube 
[-1,  H-1]^.  We  refer  to  this  as  hyph-uniform2. 

We  ran  both  Fast-LTS  and  Adaptive-LTS  on  the  data  set  for  500  stages  with  h  -  0.5.  In 
Fig.  10(a)  and  (b)  we  show  the  results  for  Fast-LTS  and  Adaptive-LTS,  respectively.  Again,  both 
algorithms  converge  rapidly  to  a  solution  that  is  nearly  optimal.  However,  as  in  the  previous 
experiment,  a  user  of  Adaptive-LTS  could  use  the  ratio  between  the  best  upper  and  lower  bounds 
to  limit  the  maximum  error  in  the  LTS  cost.  For  example,  after  roughly  250  stages  the  Adaptive- 
LTS  relative  error  is  within  20%,  and  after  roughly  400  stages  it  is  within  10%.  Observe  that  the 
rate  of  convergence  is  slower  than  in  the  earlier  2-dimensional  experiment.  We  conjecture  that 
this  is  due  to  the  increase  in  the  dimension,  which  results  in  searching  a  larger  solution  space 
(see  Section  5.2). 


Fast-LTS:  Cost  per  Stage  (Hyp+Uniform) 


Adaptive-LTS:  Cost  per  Stage  (Hyp-rUniform) 
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Figure  10:  LTS  cost  versus  stage  number  for  the  hyp+uniform  distribution  {d  =  3,  h  =  0.5)  for  (a):  Fast-LTS  and  (b): 
Adaptive-LTS. 


hyp-hhalf-unif:  This  involved  a  data  set  consisting  of  1000  points  in  (two  independent  and  one 
dependent  variable),  in  which  the  outlier  distribution  was  more  skewed.  The  inliers  were  gen¬ 
erated  in  the  same  manner  as  in  the  previous  experiment.  The  outliers  were  sampled  uniformly 
from  the  portion  of  the  3-dimensional  hypercube  [-1,  +1]^  that  lies  above  the  plane.  (Fig.  8(b) 
illustrates  this  distribution  in  M^.)  As  before,  we  ran  both  Fast-LTS  and  Adaptive-LTS  on  the 
data  set  for  500  stages  with  h  -  0.5.  In  Fig.  11(a)  and  (b)  we  show  the  results  for  Fast-LTS 
and  Adaptive-LTS,  respectively.  The  convergence  rates  of  the  two  algorithms  are  similar  to  the 
previous  experiment.  For  example,  after  roughly  200  stages  the  relative  error  is  within  20%,  and 
after  roughly  300  stages  it  is  within  10%. 

FLAT H- sphere:  Fot  this  experiment  we  designed  a  rather  pathological  distribution  with  the  inten¬ 
tion  of  challenging  both  algorithms.  The  inlier  distribution  was  chosen  so  that  almost  all  the 
inliers  are  degenerate  (thus  resulting  in  many  unstable  elemental  fits)  and  the  remaining  few 
inliers  are  a  huge  distance  away  (thus  penalizing  any  inaccurate  elemental  fits). 
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Fast-LTS:  Cost  per  Stage  (Hyp+Half-Unif) 


Adaptive-LTS:  Cost  per  Stage  (Hyp+Half-Unif) 
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Figure  11:  LTS  cost  versus  stage  number  for  the  hyp+half-unif  distribution  {d  =  3,  h  =  0.5)  for  (a):  Fast-LTS  and  (b): 
Adaptive-LTS. 


The  point  set  consisted  of  1000  points  in  with  500  inliers  and  500  outliers.  Let  S  be 
a  large  sphere  centered  at  the  origin  with  radius  10,000,  let  /3  he  a  plane  passing  through  the 
origin,  and  let  £  be  a  line  passing  through  the  origin  that  lies  on  ji  (see  Fig.  12).  The  inliers 
were  drawn  from  two  distributions.  First,  490  local  inliers  were  sampled  uniformly  along  a  line 
segment  of  length  two  centered  at  the  origin  and  lying  on  £,  where  each  point  was  perturbed  by 
a  multivariate  Gaussian  with  mean  zero  and  standard  deviation  0.10.  The  remaining  10  inliers, 
called  leverage  inliers,  were  sampled  uniformly  from  the  equatorial  circle  where  p  intersects  S . 
We  set  the  coverage  to  h  —  0.5,  implying  that  exactly  500  points  are  used  in  the  computation  of 
the  LTS  cost.  Finally,  the  outliers  consisted  of  500  points  that  were  sampled  uniformly  from  S . 
We  refer  to  this  distribution  as  flat+sphere2. 


2 


To  see  why  this  distribution  is  particularly  difficult,  observe  that  barring  lucky  coincidences, 
the  only  way  to  obtain  a  low-cost  LTS  fit  is  to  combine  all  the  local  inliers  with  all  the  leverage 
inliers  to  obtain  a  fit  that  lies  very  close  to  p.  Thus,  any  elemental  fit  that  involves  even  one 
outlier  is  unlikely  to  provide  a  good  fit.  Because  of  their  degenerate  arrangement  along  £,  any 
random  sample  of  inliers  that  fails  to  include  at  least  one  leverage  inlier  is  unlikely  to  lie  close  to 
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p.  The  probability  of  sampling  three  inliers  such  that  at  least  one  is  a  leverage  inlier  is  roughly 
(500/1000)^  ■  10/500  =  0.025.  Note  that  by  decreasing  the  fraction  of  leverage  inliers  relative  to 
the  total  number  of  inliers,  we  can  make  this  probability  of  a  useful  elemental  fit  arbitrarily  small. 
This  point  set  clearly  fails  both  of  the  criteria  given  in  Section  2  for  well-conditioned  point  sets, 
since  the  vast  majority  of  inliers  are  degenerately  positioned,  and  there  are  a  nontrivial  number 
of  leverage  inliers. 

Our  experiments  bear  out  the  problems  of  this  poor  numerical  conditioning  (see  Fig.  13). 
Both  algorithms  cast  about  rather  aimlessly  until  hitting  upon  a  good  fit  (around  stage  230  for 
Fast-LTS  and  stage  50  for  Adaptive-LTS).  This  experiment  demonstrates  the  principal  advantage 
of  Adaptive-LTS  over  Fast-LTS.  A  user  of  Fast-LTS  who  saw  only  the  results  of  the  first  200 
iterations  might  be  tempted  to  think  that  the  algorithm  had  converged  to  an  optimal  LTS  cost  of 
roughly  3.8,  rather  than  the  final  cost  of  roughly  0.102.  Terminating  the  algorithm  prior  to  stage 
200  would  have  resulted  in  a  relative  error  of  over  3000%.  In  contrast,  a  user  of  Adaptive-LTS 
would  know  by  stage  200  that  the  relative  error  in  the  final  LTS  cost  is  less  than  30%  and  after 
stage  300  it  is  close  to  10%. 


Fast-LTS:  Cost  per  Stage  (Flat+Sphere) 


Adaptive-LTS:  Cost  per  Stage  (Flat-rSphere) 
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Figure  13:  LTS  cost  versus  stage  number  for  the  f^lat+sphere  distribution  (d  =  3,  h  =  0.5)  for  (a):  Fast-LTS  and  (b): 
Adaptive-LTS. 


5.2.  Execution  Times  and  Dimension  Dependence 

So  far  we  have  analyzed  only  the  LTS  costs  generated  by  the  two  algorithms.  An  important 
question  is  how  the  performance  of  Adaptive-LTS  compares  with  Fast-LTS.  Table  1  summarizes 
the  execution  times  and  LTS  costs  for  both  of  these  algorithms  for  500  stages.  In  all  cases,  both 
algorithms  generated  essentially  the  same  result  and,  hence,  achieved  the  same  LTS  cost.  While 
the  execution  times  of  Adaptive-LTS  were  consistently  higher  than  those  of  Fast-LTS,  it  was  only 
by  a  small  constant  factor. 

Another  issue  is  how  the  algorithms  scale  with  dimension.  Assuming  that  n  »  d,  the  run¬ 
ning  time  of  a  fixed  number  k  of  iterations  of  Fast-LTS  grows  as  O(kdn).  Of  course.  Theorem  1 
predicts  that  the  number  of  iterations  needed  to  achieve  an  accurate  fit  is  expected  to  grow  ex¬ 
ponentially  with  the  dimension.  In  contrast,  the  running  time  of  Adaptive-LTS  is  much  harder 
to  predict.  This  is  because  it  depends  on  the  effectiveness  of  the  branch-and-bound  pruning.  By 
Lemma  2),  the  running  time  per  cell  grows  as  0{n  log  n),  but  it  is  reasonable  to  believe  that  the 
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Table  1:  Summary  of  execution  times  and  LTS  costs  for  the  synthetic  experiments  for  500  iterations  of  each  algorithm. 
Running  times  are  given  in  CPU  seconds. 


Fast-LTS  Adaptive-LTS 


Distribution 

Indep.  Vars. 

d 

h 

Fig. 

Time 

LTS  Cost 

Time 

LTS  Cost 

HYP-HUNIFORMi 

1 

2 

0.1 

9 

0.14 

0.00660 

0.44 

0.00660 

HYP-I-UNIFORM2 

2 

3 

0.5 

10 

0.47 

0.00776 

0.83 

0.00776 

HYP-I-HALF-UNIF2 

2 

3 

0.5 

11 

0.48 

0.00440 

0.73 

0.00440 

FLAT-I-SPHERE2 

2 

3 

0.5 

13 

0.81 

0.10220 

1.11 

0.10220 

number  of  cells  will  grow  exponentially  with  the  dimension  of  the  solution  space,  which  equals 
the  number  of  independent  variables.  An  intuitive  explanation  is  that  the  accuracy  of  a  cell’s 
lower  bound  depends  on  its  diameter,  and  the  number  of  cells  of  a  given  diameter  needed  to 
cover  a  hypercube  of  a  given  side  length  (the  initial  cell)  grows  exponentially  with  the  dimension 
of  the  hypercube. 

We  ran  both  Fast-LTS  and  Adaptive-LTS  on  the  same  1000-point  instances  of  hyp-huniform, 
where  the  number  of  independent  variables  ranged  from  one  to  nine  (that  is,  2  <  <  10).  As  in 

the  earlier  distributions,  55%  of  the  points  (inliers)  had  li-  1  independent  variables  sampled  uni¬ 
formly  from  [-1,  and  the  remaining  dependent  variable  was  generated  to  lie  on  a  random 

{d  -  l)-dimensional  hyperplane  plus  a  Gaussian  error  of  mean  0  and  standard  deviation  0.01. 
The  remaining  45%  of  points  (outliers)  were  generated  uniformly  in  a  rZ-dimensional  hypercube. 
Because  it  has  d  -  \  independent  variables,  we  refer  to  this  distribution  as  hyph-uniform^^i.  In 
each  case  we  requested  a  coverage  of  50%. 

Execution  Time  vs.  Dimension  (Hyp+Unif) 
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Figure  14:  Execution  times  as  a  function  of  (a)  the  number  of  independent  variables  and  (b)  the  coverage  value. 
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We  ran  Fast-LTS  and  Adaptive-LTS  each  for  1000  iterations.  We  also  ran  Adaptive-LTS  to 
its  natural  termination  with  Sy  -  10  and  Sq  -  0.  (That  is,  we  allowed  an  error  factor  of  10  in  the 
LTS  cost,  and  allowed  for  no  reduction  in  the  coverage.)  Fig.  14(a)  shows  the  the  execution  times 
in  CPU  seconds  (plotted  on  a  log  scale).  All  three  converged  to  nearly  the  same  final  fit  and  had 
very  similar  LTS  costs.  As  hypothesized,  the  execution  times  for  the  fixed-iteration  versions  are 
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relatively  insensitive  to  variations  in  the  dimension,  but  the  execution  times  of  full  Adaptive-LTS 
show  a  strong  dependence  on  dimension. 

Next,  we  considered  the  impact  of  varying  the  coverage  parameter  h.  We  ran  the  same 
experiment  as  above  but  with  the  number  of  independent  variables  fixed  at  three  {d  -  4)  and  with 
h  varying  from  10%  to  50%.  Fig.  14(b)  shows  the  the  execution  times  in  CPU  seconds  (plotted 
on  a  log  scale).  The  execution  times  for  the  fixed-iteration  versions  are  relatively  insensitive  to 
the  coverage.  The  execution  time  of  full  Adaptive-LTS  is  strongly  negatively  correlated  with 
h.  This  is  rather  surprising,  since  it  would  seem  that  decreasing  the  coverage  should  make  the 
problem  easier  to  solve.  A  detailed  analysis  reveals  that  the  issue  is  the  cell  lower  bounds  used 
by  the  branch-and-bound  algorithm.  Smaller  coverage  values  resulted  in  smaller  lower  bound 
values.  The  upper  bounds,  while  smaller,  were  not  as  strongly  affected.  As  a  result,  cells  were 
not  killed  as  efficiently,  and  this  resulted  in  larger  running  times. 

5.3.  Empirical  Analysis  of  Numerical  Condition 

The  variation  in  convergence  rates  seen  in  Section  5.1  raises  the  question  how  these  data  sets 
compare  to  one  another  with  respect  to  the  numerical  condition  measures  described  in  Section  2. 
To  evaluate  this,  we  ran  a  series  of  experiments  involving  both  the  low-dimensional  data  sets  from 
Section  5.1  and  two  higher-dimensional  data  sets,  hyp-huniformio  and  hyph-uniform2o.  These 
data  sets  and  the  associated  coverage  values  used  in  the  experiments  are  summarized  in  Table  2. 


Table  2:  Summary  of  data  sets  (inlier  subsets) 


Distribution 

Indep.  Vars. 

Inliers 

Residual  Std.  Dev. 

q 

HYP-HUNIFORMi 

1 

110 

0.01 

10% 

HYP-I-UNIFORM2 

2 

550 

0.01 

50% 

HYP-I-HALF-UNIF2 

2 

550 

0.01 

50% 

HYP-HUNIFORMio 

10 

550 

0.01 

50% 

HYP-I-UNIFORM20 

20 

550 

0.01 

50% 

FLATH-SPHERE2 

2 

500 

0.10 

50% 

For  each  data  set,  we  computed  the  numerical  condition  measures  y  and  A,  as  defined  in 
Eq.(l)  from  Section  2.  These  are  shown  in  Table  3.  In  particular,  the  matrix  condition  mea¬ 
sure  y  is  the  (estimated)  median  condition  number  of  the  set  of  elemental-fit  matrices  Xe.  Its 
value  was  computed  by  randomly  sampling  10,000  elemental  subsets  and  computing  the  median 
(Frobenius)  condition  number  of  the  resulting  matrices.  The  leverage  measure  A  was  computed 
(exactly)  from  Eq.(l)  as  the  square  root  of  the  ratio  of  the  average  and  the  median  of  squared 
norms  of  the  vectors  of  independent  variables. 

We  also  computed  the  number  of  elemental  fits  from  Theorem  1  needed  to  guarantee  (with 
probability  at  least  1  -  <5)  a  fit  that  achieves  a  relative  error  of  at  most  sflyA.  This  is  shown  in 
the  fourth  column  of  Table  3.  Since  Theorem  1  merely  provides  bounds,  we  ran  an  experiment 
to  determine  how  well  random  elemental  fits  perform  for  these  data  sets.  Eor  each  data  set 
(consisting  of  1000  points,  including  both  inkers  and  outliers)  we  generated  one  million  random 
elemental  fits.  In  each  case  we  computed  the  LTS  cost  of  the  resulting  fit.  In  order  to  be  faithful  to 
Theorem  1  we  did  not  apply  any  C-steps.  We  estimated  the  optimum  LTS  cost  based  on  the  noise 
standard  deviation  used  when  generating  the  inliers  (shown  in  the  fourth  column  of  Table  2).  Let 
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Table  3:  Results  of  experiments  on  numerical  condition  for  synthetic  data  sets. 


Distribution 

r 

A 

Elem-Fits 

err<l% 

err<10% 

err<100% 

HYPH-UNIFORMi 

4.08 

1.142 

7369 

0.78% 

0.88% 

1.75% 

HYPH-UNIFORM2 

9.18 

1.027 

1179 

3.80% 

5.02% 

11.20% 

HYPH-HALF-UNIF2 

9.25 

1.018 

1179 

9.18% 

9.83% 

13.39% 

HYPH-UNIFORMio 

70.80 

1.009 

7.7x10^ 

0.01% 

0.03% 

0.04% 

HYPH-UNIFORM20 

195.08 

1.006 

OO 

X 

o 

0.00% 

0.00% 

0.00% 

FLAT-I-SPHERE2 

33.99 

2606 

1179 

0.00% 

0.01% 

0.01% 

A,  denote  the  LTS  cost  of  the  /th  elemental  fit,  and  let  cr  denote  the  this  standard  deviation.  For 
p  e  {0.01,0.1, 1.0),  we  declared  that  this  elemental  fit  achieves  a  relative  error  of  at  most  p%  if 
A,  <  (1  +  p/100)cr.  The  fractions  of  fits  satisfying  these  error  bounds  are  shown  in  the  last  three 
columns  of  Table  3.  Observe,  for  example,  that  for  the  data  set  hyp+uniform2,  roughly  5%  of 
the  elemental  fits  achieved  a  relative  error  of  at  most  10%,  suggesting  that  fits  of  this  accuracy 
occur  randomly  at  a  rate  of  roughly  one  out  of  every  twenty.  These  results  qualitatively  support 
Theorem  1 ,  but  the  observed  errors  were  quite  a  bit  smaller  than  the  bounds  given  by  the  theorem. 

5.4.  DPOSS  Data  Set 

Our  final  experiment  involved  a  point  set  from  the  Digitized  Palomar  Sky  Survey  (DPOSS), 
from  the  California  Institute  of  Technology  [29].  The  data  set  consisted  of  132,402  points  in  M®, 
from  which  we  sampled  10,000  at  random  for  each  of  our  experiments.  Of  the  six  coordinates, 
we  considered  the  problem  of  estimating  the  first  coordinate  MAperF  as  a  function  of  one  or  more 
of  the  coordinates  csfF,  csfj,  csfN.  Fig.  15  shows  a  scatter  plot  of  MAperF  versus  csfF,  and  also 
shows  the  LTS  fit  and  an  ordinary  least  square  (OLS)  fit.  We  used  the  same  experimental  setup 
as  in  the  synthetic  experiments,  by  running  both  Fast-LTS  and  Adaptive-LTS  for  500  stages. 


DPOSS  Scatterplot  (MAperF  vs.  csfF) 


csfF 

Figure  15:  The  DPOSS  data  set  (MAperF  vs.  csfF). 

In  the  first  experiment,  we  considered  a  2-dimensional  instance  of  estimating  MAperF  as  a 
function  of  csfF  (see  Fig.  16).  Both  algorithms  rapidly  converged  to  essentially  the  same  solution 
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(shown  in  Fig.  15).  Because  it  provides  a  lower  bound,  a  user  of  Adaptive-LTS  would  know  after 
just  30  stages  that  the  relative  error  in  the  LTS  cost  of  this  solution  is  less  than  5%. 


Fast-LTS:  Cost  per  Stage 


(a) 


Adaptive-LTS:  Cost  per  Stage 
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Figure  16:  LTS  cost  versus  stage  number  for  estimating  MAperF  as  function  of  csfF  in  the  DPOSS  data  set  {d  =  2)  for 
(a):  Fast-LTS  and  (b):  Adaptive-LTS. 


Our  next  experiment  considered  a  3-dimensional  instance  of  estimating  MAperF  as  a  function 
of  both  csfF  and  csfJ  (see  Fig.  17).  Again,  both  algorithms  rapidly  converged  to  essentially  the 
same  solution.  A  user  of  Adaptive-LTS  would  know  that  after  roughly  250  stages  the  Adaptive- 
LTS  relative  error  is  within  20%,  and  after  roughly  500  stages  it  is  within  10%. 


Fast-LTS:  Cost  per  Stage 


Figure  17:  LTS  cost  versus  stage  number  for  the  DPOSS 


Adaptive-LTS:  Cost  per  Stage 


(b) 

set  {d  =  3)  for  (a):  Fast-LTS  and  (b):  Adaptive-LTS. 


Our  final  experiment  demonstrates  one  of  the  practical  limitations  of  Adaptive-LTS.  We  con¬ 
sidered  a  4-dimensional  instance  of  estimating  MAperF  as  a  function  of  csfF,  csfJ,  and  csfN  (see 

Fig.  18). 

Again,  both  algorithms  rapidly  converged  to  essentially  the  same  solution.  However,  due  to 
the  increase  in  dimension,  the  convergence  in  the  error  bound  is  much  slower.  After  500  stages, 
the  relative  error  bound  for  Adaptive-LTS  is  over  100%.  Although  it  is  not  evident  from  the  plot, 
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Fast-LTS:  Cost  per  Stage 


Figure  18:  LTS  cost  versus  stage  number  for  the  DPOSS 


Adaptive-LTS:  Cost  per  Stage 


(b) 

set  {d  =  4)  for  (a):  Fast-LTS  and  (b):  Adaptive-LTS. 


it  is  only  after  roughly  5,000  stages  that  the  error  bound  falls  below  20%,  and  after  10,000  stages 
it  falls  below  10%. 

As  in  our  synthetic  experiments,  we  compared  the  relative  performance  of  Adaptive-LTS 
and  Fast-LTS.  In  Table  4  we  summarize  the  execution  times  and  LTS  costs  for  both  of  these 
algorithms  for  500  stages.  Both  algorithms  generated  essentially  the  same  fitting  hyperplane  and, 
hence  achieved  the  same  LTS  cost.  Because  of  its  higher  overhead,  Adaptive-LTS  did  require 
more  execution  time  for  the  same  number  of  stages,  but  in  all  instances  the  running  time  was  less 
than  50%  greater  than  Fast-LTS. 


Table  4:  Summaiy  of  execution  times  and  LTS  costs  for  the  DPOSS  experiments  for  500  iterations  of  each  algorithm. 
Running  times  are  given  in  CPU  seconds. 


Fast-LTS 

Adaptive-LTS 

d 

h 

Fig. 

Time 

LTS  Cost 

Time 

LTS  Cost 

2 

0.5 

16 

5.02 

0.0884 

7.31 

0.0884 

3 

0.5 

17 

5.77 

0.0915 

7.40 

0.0915 

4 

0.5 

18 

7.72 

0.1308 

8.54 

0.1311 

6.  Concluding  Remarks 

In  this  paper  we  have  presented  two  results  related  to  the  computation  of  the  linear  least 
trimmed  squares  estimator.  First,  we  introduced  a  measure  of  the  numerical  condition  of  a  point 
set.  We  showed  that  if  a  point  set  is  well  conditioned,  then  it  is  possible  to  bound  the  accuracy 
(as  a  function  of  the  conditioning  parameters)  of  the  LTS  fit  resulting  from  a  sufficiently  large 
number  of  random  elemental  fits.  Second,  we  presented  an  approximation  algorithm  for  LTS, 
called  Adaptive-LTS.  Given  a  point  set,  a  coverage  value,  and  bounds  on  the  maximum  and 
minimum  slope  coefficients,  this  algorithm  computes  a  hybrid  approximation  to  the  optimum 
LTS  ht  assuming  the  slope  coefficients  are  drawn  from  these  bounds.  We  implemented  this 
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algorithm  and  presented  empirical  evidence  on  a  variety  of  data  sets  of  this  algorithm’s  efficiency 
and  effectiveness.  In  contrast  to  existing  practical  approaches,  this  algorithm  computes  a  lower 
bound  on  the  optimum  LTS  cost.  Therefore,  when  the  algorithm  terminates  it  provides  an  upper 
bound  on  the  approximation  error.  The  software  is  freely  available  from  the  first  author’s  web 
page,  http// WWW. cs .umd. edu/ -mount. 

There  are  a  number  of  possible  directions  for  future  research.  First,  our  analysis  of  well- 
conditioned  point  sets  considered  only  the  results  of  random  elemental  fits,  but  it  ignored  the 
effect  of  possible  C-steps.  It  would  be  interesting  to  consider  the  effect  of  the  numeric  condition 
on  the  convergence  rate  of  the  C-steps.  Second,  an  obvious  weakness  of  our  Adaptive-LTS 
algorithm  is  its  reliance  on  the  requirement  that  the  initial  bounds  on  the  slope  coefficients  be 
given.  Although  the  program  can  estimate  these  bounds  from  random  elemental  fits,  we  cannot 
guarantee  that  the  optimum  fit  lies  within  these  bounds.  It  would  be  nice  to  have  the  algorithm 
compute  the  initial  bounds  in  such  a  manner  that  the  optimum  LTS  cost  is  guaranteed  to  lie  within 
them.  Finally,  although  we  showed  that  each  stage  of  the  algorithm  runs  in  0{m  +  n  log  ri)  (where 
n  is  the  number  of  points  and  m  is  the  number  of  active  cells),  we  do  not  have  good  bounds  on 
the  number  of  stages  until  termination.  Deriving  asymptotic  time  bounds  on  the  overall  running 
time  of  the  algorithm  would  be  useful. 

7.  Acknowledgments 

We  would  like  to  thank  Peter  Rousseeuw  for  providing  us  with  the  DPOSS  data  set.  Also, 
we  are  very  grateful  to  the  anonymous  referees  for  their  informative  and  helpful  comments. 

References 

[1]  P.  J.  Rousseeuw,  A.  M.  Leroy,  Robust  Regression  and  Outlier  Detection,  Wiley,  New  York,  1987. 

[2]  P.  J.  Rousseeuw,  Least  median-of-squares  regression.  Journal  of  the  American  Statistical  Association  79  (1984) 
871-880. 

[3]  D.  L.  Souvaine,  J.  M.  Steele,  Time-  and  space-  efficient  algorithms  for  least  median  of  squares  regression.  Journal 
of  the  American  Statistical  Association  82  (1987)  794-801. 

[4]  H.  Edelsbrunner,  D.  L.  Souvaine,  Computing  median-of-squares  regression  lines  and  guided  topological  sweep. 
Journal  of  the  American  Statistical  Association  85  (1990)  1 15-1 19. 

[5]  D.  M.  Mount,  N.  S.  Netanyahu,  C.  Piatko,  R.  Silverman,  A.  Y.  Wu,  Quantile  approximation  for  robust  statistical 
estimation  and  ^-enclosing  problems,  International  Journal  of  Computational  Geometry  &  Applications  10  (2000) 
593-608. 

[6]  D.  M.  Mount,  N.  S.  Netanyahu,  E.  Zuck,  Analyzing  the  number  of  samples  required  for  an  approximate  Monte- 
Carlo  LMS  line  estimator,  in:  M.  Hubert,  G.  Pison,  A.  Struyf,  S.  V.  Aelst  (Eds.),  Theory  and  Applications  of  Recent 
Robust  Methods,  Statistics  for  Industry  and  Technology,  Birkhauser,  Basel,  2004,  pp.  207-219. 

[7]  D.  M.  Mount,  N.  S.  Netanyahu,  K.  R.  Romanik,  R.  Silverman,  A.  Y.  Wu,  A  practical  approximation  algorithm  for 
the  LMS  line  estimator.  Computational  Statistics  &  Data  Analysis  51  (2007)  2461-2486. 

[8]  T.  Bemholt,  Computing  the  least  median  of  squares  estimator  in  time  O(n^),  in:  Proc.  Inti.  Conf.  on  Computational 
Science  and  Its  Applications,  Vol.  3480  of  Springer  LNCS,  Springer- Verlag,  Berlin,  2005,  pp.  697-706. 

[9]  J.  Erickson,  S.  Har-Peled,  D.  M.  Mount,  On  the  least  median  square  problem,  Discrete  &  Computational  Geometry 
36  (2006) 593-607. 

[10]  J.  Agullo,  New  algorithms  for  computing  the  least  trimmed  squares  regression  estimator.  Computational  Statistics 
&  Data  Analysis  36  (2001)  425^39. 

[11]  M.  Hofmann,  E.  J.  Kontoghiorghes,  Matrix  strategies  for  computing  the  least  trimmed  squares  estimation  of  the 
general  linear  and  sur  models.  Computational  Statistics  &  Data  Analysis  54  (2010)  3392-3403. 

[12]  M.  Hofmann,  C.  Gatu,  E.  J.  Kontoghiorghes,  An  exact  least  trimmed  squares  algorithm  for  a  range  of  coverage 
values,  J.  Comput.  and  Graph.  Stat.  19  (2010)  191-204. 

[13]  O.  Hossjer,  Exact  computation  of  the  least  trimmed  squares  estimate  in  simple  linear  regression.  Computational 
Statistics  &  Data  Analysis  19  (1995)  265-282. 


28 


[14]  D.  M.  Mount,  N.  S.  Netanyahu,  C.  Piatko,  R.  Silverman,  A.  Y.  Wu,  On  the  least  trimmed  squares  estimator, 
Algorithmica  69  (2014)  148-183. 

[15]  P.  J.  Rousseeuw,  K.  van  Driessen,  Computing  LTS  regression  for  large  data  sets,  Data  Mining  and  Knowledge 
Discovery  12  (2006)  29^5. 

[16]  F.  Torti,  D.  Perrotta,  A.  C.  Atkinson,  M.  Riani,  Benchmark  testing  of  algorithms  for  very  robust  regression:  FS, 
LMS  and  LTS,  Computational  Statistics  &  Data  Analysis  56  (2012)  2501-2512. 

[17]  D.  M.  Hawkins,  The  feasible  solution  algorithm  for  least  trimmed  squares  regression.  Computational  Statistics  & 
Data  Analysis  17  (1994)  185-196. 

[18]  P.  Diaconis,  M.  Shahshahani,  The  subgroup  algorithm  for  generating  uniform  random  variables,  Prob.  in  the  Engr. 
and  Inf.  Sci.  1  (1987)  15-32. 

[19]  S.  H.  Friedberg,  A.  J.  Insel,  L.  E.  Spence,  Linear  Algebra,  4th  Edition,  Prentice  Hall,  2002. 

[20]  G.  H.  Golub,  C.  F.  V.  Loan,  Matrix  Computations,  The  Johns  Hopkins  University  Press,  Baltimore,  Maryland, 
1996. 

[21]  P.  J.  Rousseeuw,  A  diagnostic  plot  for  regression  outliers  and  leverage  points,  Computational  Statistics  &  Data 
Analysis  11  (1991)  127-129. 

[22]  D.  P.  Huttenlocher,  W.  J.  Rucklidge,  A  multi-resolution  technique  for  comparing  images  using  the  Hausdortf  dis¬ 
tance,  in:  Proceedings  of  the  IEEE  Conference  on  Computer  Vision  and  Pattern  Recognition,  New  York,  1993,  pp. 
705-706. 

[23]  W.  J.  Rucklidge,  Efficient  visual  recognition  using  the  Hausdortf  distance,  no.  1 173  in  Lecture  Notes  in  Computer 
Science,  Springer- Verlag,  Berlin,  1996. 

[24]  W.  J.  Rucklidge,  Efficiently  locating  objects  using  the  Hausdortf  distance.  International  Journal  of  Computer  Vision 
24(1997)  251-270. 

[25]  M.  Hagedoorn,  R.  C.  Veltkamp,  Reliable  and  efficient  pattern  matching  using  an  affine  invariant  metric.  Interna¬ 
tional  Journal  of  Computer  Vision  3 1  (1999)  103-115. 

[26]  D.  M.  Mount,  N.  S.  Netanyahu,  J.  Le  Moigne,  Efficient  algorithms  for  robust  point  pattern  matching.  Pattern 
Recognition  32  (1999)  17-38. 

[27]  T.  H.  Cormen,  C.  E.  Leiserson,  R.  L.  Rivest,  C.  Stein,  Introduction  to  Algorithms,  3rd  Edition,  MIT  Press,  Cam¬ 
bridge,  MA,  2009. 

[28]  L.  Jaulin,  M.  Kieffer,  O.  Didrit,  E.  Walter,  Applied  Interval  Analysis,  Springer- Verlag,  2001. 

[29]  S.  G.  Djorgovski,  R.  R.  Gal,  S.  C.  Odewahn,  de  R.  R.  Carvalho,  R.  Brunner,  G.  Longo,  R.  Scaramella,  The  Palomar 
digital  sky  survey  (DPOSS),  in:  Wide  Field  Surveys  in  Cosmology  (14th  lAP  meeting).  Editions  Frontieres,  Paris, 
1998,  p.  89. 


Appendix  A.  Proofs  of  Technical  Lemmas 


Proof  of  Lemma  2.  By  definition  of  the  LTS  cost  we  have 

m  jeH 


Let  H’  denote  the  indices  of  the  h  points  of  P  whose  squared  residuals  with  respect  to  p'  are  the 
smallest.  Therefore, 


jeH'  jeH'  jeH 


Thus,  we  have 


(A--A-)Vir:T  < 


We  may  interpret  A'  -  1  as  the  Euclidean  distance  between  two  points  in  particularly 
(yi, . . .  ,yh)  and  (fi'x\, . . .  ,P'xh)-  Similarly,  we  may  interpret  A*  y/h-  1  as  the  Euclidean  distance 
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between  (yi, . . .  ,yh)  and  {p* X\, . .  .p*Xh).  Let  ||u  -  v||  denote  the  Euclidean  distance  between  u 
and  V.  By  the  triangle  inequality  and  the  above  inequalities  regarding  norms,  we  have 


(A'  -  A*)  V/z-  1  < 


< 


< 


where  the  second  to  last  step  follows  from  the  Cauchy-Schwarz  inequality. 

By  the  earlier  observations  relating  j8'  and  p*  to  and  y*^,  and  another  application  of  the 
Cauchy-Schwarz  inequality  we  obtain 


(A'  -  A*)  V/z-  1  < 
< 


WyE^E 


For  each  Xj  e  P,  let  y*  =  P*Xj.  We  can  express  A*  as  YijeHiyj  ~  y*T^  from  which  it  follows 


that 


(A'  -  A*) 


s  Iiv.  -  Jtii  ■  IK'II  (2,,.  iwif  -  ytT  ■ 


By  multiplying  and  dividing  by  HA'^H,  we  obtain 

l|2  \l/2 


—  -1  < 
A* 


li  jeH^j  -  y*j)^  ) 


■  (II^£‘II-II^£||)  ■ 


IjjeHWXjf 


which  completes  the  proof. 


Proof  of  Lemma  4.  Observe  that  (by  our  general-position  assumption)  i  -  1  intervals  of  B  lie 
strictly  to  the  left  of  b^.y  and  n  -  (i  +  h  -  1)  intervals  lie  strictly  to  the  right  of  If 

^li]  ^  ^[i+h-1]’  ^  denote  the  number  of  intervals  of  B  that  contain  the  span  [by^ij_iy  b^^].  Every 

interval  of  B  is  either  among  the  i  -  1  intervals  to  the  left  of  b^.y  and/or  the  n-(i  +  h-l)  intervals 
that  lie  to  the  right  of  by^/^  ^y  or  the  k  that  contain  [by^ij_^y  by^].  Therefore, 

n  <  (i  -  1)  +  (n  -  (i  +  h  -  IJ)  +  k  <  n  -  h  +  k, 

from  which  we  conclude  that  k  >  h.  Thus,  every  point  of  the  span  [by+^^^^yb^^^  is  contained 
within  h  intervals  of  B.  This  implies  that  the  distance  from  any  such  point  to  all  of  its  h  closest 
intervals  is  zero.  Therefore,  A(Bi,h)  -  0,  and  so  A(B,h)  =  0. 

On  the  other  hand,  if  jp  then  the  i  -  1  intervals  of  B  to  the  left  of  b^^  are  disjoint 

from  the  n  -  (i  +  h  -  1)  intervals  to  the  right  of  by^f^  ^y  Since  the  remaining  n  intervals  of  B  are 
in  Bi,  it  follows  that  |B,j  -  n  -  (i  -  1)  -  (n  -  (i  +  h  -  1))  =  h,  as  desired. 
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Proof  of  Lemma  5.  Let  B'  denote  the  subset  of  B  of  size  h  that  achieves  the  minimum  LTS  cost. 
Suppose  towards  a  contradiction  that  A(B',h)  <  min,  A(B,,  h).  Let  and  denote  the  leftmost 
right  endpoint  and  rightmost  left  endpoint  in  B',  respectively.  Since  the  span  from  to  by-^  must 
overlap  at  least  h  intervals  of  B,  by  the  same  reasoning  as  in  Lemma  4,  we  have  j  >  i  +  h  -  1. 
Further,  we  may  assume  that  j  >  i  +  h-l,foi  otherwise  B'  would  be  equal  to  B,.  Thus,  B'  spans 
at  least  h  +  1  intervals,  which  implies  that  there  exists  at  least  one  “unused”  interval  of  B  that  is 
in  Bi  but  not  in  B' .  Let  I  denote  any  such  interval  (see  Fig.  A. 19(a)). 


/i  =  5 

rn  e  S' 
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1  e  s" 


add 
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Figure  A. 19:  Proof  of  Lemma  5.  By  replacing  the  interval  whose  left  endpoint  is  anchored  to  b  [j]  with  the  unused 
interval  7  anchored  to  b,  we  decrease  the  LTS  cost. 


First,  we  assert  that  b^j^  <  byy  To  see  why,  observe  that  otherwise,  b^^  is  contained  within 
h  +  1  intervals  (by  the  same  reasoning  as  in  Lemma  4),  implying  that  A(B,,  h)  -  0,  which  would 
violate  our  hypothesis  that  none  of  the  B,  ’s  is  optimal.  Given  this  assertion,  let  p  denote  the  point 
that  minimizes  the  sum  of  squared  distances  to  the  intervals  of  B'  (see  Fig.  A.  19(a)).  Clearly, 
by^  <  /r  <  byy  By  left-right  symmetry,  we  may  assume  without  loss  of  generality  that  p  is  closer 
to  b^j^  than  it  is  to  byy  (More  specifically,  if  we  negate  all  the  interval  endpoints,  we  may  apply 
the  proof  with  the  roles  of  b'^.^  and  by-^  reversed.)  Because  by-^  is  the  rightmost  left  endpoint  in 
B',  the  distance  from  jj.  to  by^  is  the  largest  among  all  the  intervals  of  B  whose  closest  endpoint 
to  fi  lies  in  the  span  [b^y  by^,  and  in  particular  this  is  true  for  the  interval  I.  Consider  the  set  B” 
formed  by  taking  B'  and  replacing  the  interval  anchored  at  by^  with  I  (see  Fig.  A. 19(b)).  Since  I 
is  closer  to  p  than  by^  is,  the  LTS  cost  of  B”  is  smaller  than  the  LTS  cost  of  B' .  This  contradicts 
our  hypothesis  that  B'  achieves  the  minimum  LTS  cost. 


Appendix  B.  Complete  Algorithm  for  the  Interval  LTS  Problem 

In  this  appendix  we  present  the  complete  algorithm  for  the  interval  LTS  problem.  The  input 
consists  of  a  set  B  of  n  nonempty,  closed  intervals  on  the  real  line  and  a  coverage  value  h.  The 
objective  is  to  determine  the  point  p  e  M  that  minimizes  the  average  squared  distances  to  the 
closest  h  intervals  and  the  cost  function  associated  with  this  point. 

Recall  from  Section  4  the  definition  of  the  cost  function  A(B,  h)  for  a  given  set  B  of  intervals. 
Also  recall  the  interval  sets  B,  and  the  /z-spans  /,,  for  1  <  i  <  n  -  h  +  1,  defined  just  prior  to 
Lemma  4.  Let  fii  denote  the  point  that  minimizes  the  sum  of  squared  distances  from  itself  to 
all  of  the  intervals  of  B,.  Our  approach  will  be  to  compute  the  initial  values,  pi  and  A{B\,h), 
explicitly  by  brute  force  in  0(h)  time.  Then,  for  i  -  2, . . .  ,n  -  h+  l,we  will  show  how  to  update 
the  subsequent  values  ju,  and  A(B,,  h),  each  in  constant  time.  By  Lemma  5,  the  jui  yielding  the 
smallest  such  cost  is  the  desired  result. 

In  the  upper  bound  algorithm  of  Section  4,  we  observed  that  the  value  that  minimizes  the 
sum  of  squared  distances  to  a  set  of  numbers  is  just  the  mean  of  the  set.  However,  generalizing 
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this  simple  observations  to  a  set  of  intervals  is  less  obvious.  Given  an  arbitrary  interval  [b^ ,b^'\ 
of  B,  the  function  that  maps  any  point  x  on  the  real  line  to  its  squared  distance  to  this  interval 
is  clearly  convex.  (It  is  not  strictly  convex  because  the  function  is  zero  for  all  points  that  lie 
within  the  interval.)  By  the  remarks  made  after  Lemma  4,  we  may  assume  that  no  h  intervals 
of  B  contain  a  common  point  (for  otherwise  the  cost  is  trivially  zero).  Since  A(B,,  h)  is  the 
minimum  of  the  sum  of  h  such  functions,  and  by  our  assumption  that  no  h  intervals  of  B  contain 
in  a  single  point,  it  follows  that  this  function  is  strictly  convex.  Therefore,  /r,  is  this  function’s 
unique  local  minimum.  Given  an  initial  estimate  on  the  location  of  /r,,  we  can  exploit  convexity 
to  determine  the  direction  in  which  to  move.  As  the  algorithm  transitions  from  to  B,,  we 
remove  one  interval  from  the  left  (the  one  anchored  at  and  add  one  interval  to  the  right 

(the  one  anchored  at  Since  we  remove  an  interval  from  the  left  end  and  add  one  on  the 

right  end,  it  follows  that  ju,  >  Thus,  the  search  for  ju,  proceeds  monotonically  from  left  to 
right. 

To  make  this  more  formal,  let  {hi, . . .  ,b2n}  denote  the  sorted  sequence  of  the  interval  end¬ 
points  of  B  in  nondecreasing  order.  This  implicitly  defines  2n  -  1  intervals  of  the  form  [bic-i,bk], 
for  1  <  k  <  2n  where  ju,  might  lie.  We  refer  to  these  intervals  as  segments  (see  Fig.  B.20(a)). 
All  the  points  within  the  interior  of  any  given  segment  have  the  property  that  they  lie  to  the 
left/right/contained  within  the  same  subset  of  intervals  of  B.  Our  algorithm  operates  by  main¬ 
taining  three  indices,  i,  j,  and  k.  We  will  set  j  -  i  +  h  -  \  so  that  /,  =  current 

h-span  of  interest.  The  index  k  specifies  the  current  segment,  [bi;,bk+i],  which  represents  the 
hypothesized  segment  that  contains  ju,.  The  algorithm  repeatedly  increases  k  as  long  as  //,  is 
determined  to  lie  to  the  right  of  bk+i- 
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Figure  B.20:  Segments  and  contributions:  (a)  The  sorted  endpoints  {b[,. . ^2«)  and  the  associated  segments  (shaded), 
(b)  the  mean  fij  and  the  distance  contributions  from  the  intervals  of  Bi. 

How  do  we  determine  whether  the  current  segment  contains  /r,?  First,  we  classify  the  every 
interval  of  B,  as  lying  to  the  left,  right,  or  containing  the  segment’s  interior.  The  intervals  of  B, 
that  contain  the  segment’s  interior  contribute  zero  to  the  squared  distance,  and  so  they  may  be 
ignored.  For  those  intervals  that  lie  entirely  to  the  segment’s  left  (resp.,  right),  the  interval’s  right 
(resp.,  left)  endpoint  determines  the  squared  distance  (see  Fig.  B. 20(b)).  Let  us  call  this  endpoint 
the  relevant  endpoint  of  the  associated  interval.  As  in  the  computation  of  the  upper  bound,  we 
will  maintain  two  quantities  sum  and  ssq,  which  will  be  updated  throughout  the  algorithm.  The 
first  stores  the  sum  of  the  relevant  endpoints  and  the  second  stores  the  sum  of  squares  of  the 
relevant  endpoints.  As  in  Eq.  (4),  the  value  /i,  and  the  LTS  cost  A(B,,  h)  -  cri  will  be  derived 
from  these  quantities. 

Each  time  we  advance  the  current  segment  by  incrementing  k,  we  are  now  either  entering  an 
interval  of  B  (if  bk  is  a  left  endpoint)  or  leaving  an  interval  (if  bk  is  a  right  endpoint).  In  the  first 
case  the  interval  we  are  entering  was  contributing  its  left  endpoint  to  (sum  and  ssq)  and  is  now 
making  no  contribution  (see  Eig.  B.21(a)).  In  the  second  case,  the  interval  we  are  leaving  was 
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making  no  contribution  and  is  now  contributing  its  right  endpoint  (see  Fig.  B.21(b)). 
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Figure  B.21:  Contribution  updates  when  advancing  the  current  segment. 


We  can  now  present  the  complete  algorithm.  Throughout,  the  algorithm  maintains  Abest, 
which  stores  the  minimum  LTS  cost  of  any  encountered  thus  far. 

(1)  [Initializations:] 

(a)  [Sort:]  Sort  the  left  endpoints  of  B,  denoted  the  right  endpoints,  de¬ 

noted  {b^iy  ■  ■  ■  and  all  the  endpoints,  denoted  (hi, . . .  ,b2n)- 

(b)  [Initialize  costs:]  Set  sum  <-  ssq  <-  and  Abest  <-  +o°- 

(c)  [Initialize  indices:]  Let  /  <—  k  <—  1  and  j  <—  h. 

(2)  [Main  loop:]  While  i  <  n  -  h  -  I  do: 

(a)  [Check  for  h-fold  interval  overlap:]  If  b^{i\  >  b~{J\,  set  Abest  <—  0  and  return  (by  the 
remarks  made  after  Lemma  4). 

(b)  [Update  yu:]  Recalling  Eq.  (4),  set  /r  <—  sum/M. 

(c)  [Find  the  segment  that  contains  ju:]  While  bk+\  <  B  do: 

(i)  [Advance  to  the  next  segment:]  Set  k  <—  k  H-  1. 

(ii)  [Entering  an  interval?]  If  b),  is  the  left  endpoint  of  some  interval  of  B,  set  sum  <— 
SUM  -  bk  and  ssq  <—  ssq  -  (h^)^  (see  Fig.  B.21  (a)) 

(iii)  [Exiting  an  interval?]  Otherwise,  bk  is  the  right  endpoint  of  some  interval  of  B. 
Set  SUM  <—  SUM  +  bk  and  ssq  <—  ssq  -h  {hkY  (see  Fig.  B. 2 1(b)) 

(iv)  [Update  /r:]  Set  fi  <—  sum/M. 

(d)  [Update  LTS  cost:]  Recalling  Eq.  (4),  set  cr  <—  .  If  cr  <  Abest  then 

set  Abest  ^  ^ • 

(e)  [Remove  contribution  of  leftmost  interval  b^y}  sum  <—  sum  -  b^^  and  ssq  <—  ssq  - 

(f)  [Advance  to  next  subset:]  Set  /  <—  /  H-  1  and  y  <—  j  +  1. 

(g)  [Add  contribution  of  rightmost  interval  h|j.j:]  sum  <—  suM-Hhj^.^  and  ssq  <—  ssQH-(h|‘.j)^. 

(3)  Return  Abest  as  the  final  LTS  cost 

The  algorithm’s  correctness  has  already  been  justified.  To  establish  the  algorithm’s  running 
time,  observe  that  Step  (la)  takes  0{n  log  n)  time,  and  Step  (lb)  takes  0(h)  time.  All  the  other 
steps  take  only  constant  time.  Each  time  the  loop  of  Step  (2c)  is  executed,  the  value  of  k  is 
increased.  Since  k  cannot  go  beyond  the  last  segment  (since  clearly  p  <  b2n),  the  total  number  of 
iterations  of  this  loop  throughout  the  entire  algorithm  is  at  most  2n.  Since  the  loop  of  Step  (2)  is 
repeated  at  most  n  -  h  +  1  times,  the  overall  running  time  is  0(n  log  n  +  h  +  2n  +  (n-h  -  1))  = 
0(n  log  n),  as  desired. 
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